CLI Reference
Docker Usage
If provided with a docker image, you can run any of the commands mentioned in this reference by prefixing it with:
docker run --gpus all -v $(pwd):/directory
/directory inside the container. This would allow you to run the bench commands on the files located in your current working directory on your local disk. For example, to run the compute-perplexity command via docker, we'll use:
docker run --gpus all -v $(pwd):/directory bench compute-perplexity \
--file-path ./input_path/file.json \
--model-path ./results/perplexity \
--max-length 2048
docker run --gpus all -v $(pwd):/directory -e API_KEY=<KEY> bench compute-alpaca-eval-from-model \
--model-configs <MODEL-CONFIGS> \
--output-path <OUTPUT-PATH>
Docker Usage with a License key
If provided with a docker image and a license key, you can run any of the commands mentioned in this reference by prefixing it with
docker run -v <LICENSE-KEY-DIR-PATH>:/bench/runtime_key -e PYARMOR_RKEY=/bench/runtime_key -v $(pwd):/directory
-
Mount the local path of your provided license key on your disk onto the the path
/bench/runtime_keyinside the docker container, and then set the environment variablePYARMOR_RKEYpointing to that mounted directory containing your license key. -
Mount your current working directory on your local disk onto the path
/directoryinside the container. This would allow you to run thebenchcommands on the files located in your current working directory on your local disk.
For example, to run the compute-perplexity command via docker, we'll use (assuming your license key is locally under /keys/bench):
docker run --gpus all -it -v /keys/bench:/bench/runtime_key -e PYARMOR_RKEY=/bench/runtime_key -v $(pwd):/directory bench compute-perplexity \
--file-path ./input_path/file.json \
--model-path ./results/perplexity \
--max-length 2048
bench
BeyondBench is a tool designed to evaluate LLM models with various task evaluations.
Usage:
bench [OPTIONS] COMMAND [ARGS]...
Options:
--install-completion Install completion for the current shell.
--show-completion Show completion for the current shell, to copy it or
customize the installation.
benchmark-serving
Evaluate a model's performance by sending requests to an API, measuring metrics like latency, throughput, TTFT and TPOT.
Usage:
bench benchmark-serving [OPTIONS]
Options:
--model-id TEXT Request payload: The model identifier sent
in the request payload \[required]
--tokenizer-id TEXT The tokenizer id that used to get the length
in tokens for the request and response text
\[required]
--base-url TEXT The base URL for the API endpoint.
\[required]
--result-path TEXT Result Path to save the benchmark results
and the figure \[required]
--trust-remote-code / --no-trust-remote-code
Trust remote code from huggingface to load
the tokenizer \[default: no-trust-remote-
code]
--endpoint TEXT The specific endpoint to use for serving
requests. We can support more endpoints in
the future. \[default:
/v1/chat/completions]
--dataset-path TEXT The path to the dataset file that will be
used to generate requests for benchmarking
--story-length TEXT The length category of the stories ('1k',
'2k', '3k', '4k') \[default: 1k]
--plot-title-description TEXT Description for the plot title
--best-of INTEGER Request payload: Generates `best_of`
sequences per prompt and returns the best
one. \[default: 1]
--use-beam-search / --no-use-beam-search
Request payload: Flag to indicate whether to
use beam search for generation. \[default:
no-use-beam-search]
--max-tokens INTEGER Request payload: Maximum number of tokens to
generate in the response. \[default: 1024]
--temperature FLOAT Request payload: Sampling temperature to
control randomness of output. \[default:
0.0]
--num-requests INTEGER Number of requests to generate for the
benchmark. \[default: 20]
--request-rate FLOAT Number of requests per second. If this is
inf, then all the requests are sent at time
0. Otherwise, we use Poisson process to
synthesize the request arrival times.
\[default: inf]
--seed INTEGER Random seed for reproducibility \[default:
0]
--force-overwrite / --no-force-overwrite
Flag to determine whether to overwrite
existing results. \[default: no-force-
overwrite]
compute-alpaca-eval
Compute Alpaca Evaluation from output files.
This function is used to compute the Alpaca evaluation for a model using its outputs. The results are compared against the reference outputs using the specified annotator configuration.
Usage:
bench compute-alpaca-eval [OPTIONS] MODEL_OUTPUTS
Options:
MODEL_OUTPUTS Path to the outputs of the model under
evalution. \[required]
--reference-outputs PATH Path to the outputs of the reference model.
If not provided, we use the default
Davinci003 outputs on the AlpacaEval set.
--annotator-config TEXT Path to annotator yaml config file.
--output-path PATH Destination path for saving Aplaca
evaluation results. If not provided, default
location will be used. \[default:
results/compute-alpaca-eval]
--force-overwrite / --no-force-overwrite
Flag to force to overwrite the results file
if it already exists \[default: no-force-
overwrite]
compute-alpaca-eval-from-model
Evaluate a model againts reference models.
This function is a tool for evaluating language models from Hugging Face, API providers, or local trained/fine-tuned models. Initially, it executes the model on the Alpaca-eval dataset. Following this, an annotator evaluates the output by comparing it to a reference model. Finally, the function calculates the win_rate of the models and incorporates them into the leaderboard.
Notes
- When the model_config file is provided, the function will attempt to load and possibly modify its parameters based on other given arguments.
- If no model_config file is provided but a model_name is given, a default configuration template is used.
- For configurations or options not explicitly passed as arguments, default values or behaviors are assumed.
Usage:
bench compute-alpaca-eval-from-model [OPTIONS]
Options:
--model-config TEXT The path to a yaml file containing the
configuration of the model to decode from.
--annotator-config TEXT Name of annotators configuration. If None,
we use the default annotators configuration.
\[default: alpaca_eval_gpt4]
--reference-model-configs TEXT The path to a yaml file containing the
configuration of the reference model. If not
provided, we use the default Davinci003
outputs on the AlpacaEval set.
--output-path PATH Destination path for saving Aplaca
evaluation results. If not provided, default
location will be used. \[default:
results/compute-alpaca-eval]
--model-name TEXT The name or local path of the model. Used if
no model_config file is passed.
--model-config-name TEXT The display name of the model on leadboard.
Used if no model_config file is passed.
--prompt-template TEXT Path to the prompt template.
--max-new-tokens INTEGER Maximum number of tokens for model outputs.
--mode TEXT Loading a model in gptq mode or not.
--fn-completions TEXT Function from `alpaca_farm.decoders` for
completions.
--torch-dtype TEXT Data type for PyTorch model. Eg. 'float16'.
--trust-remote-code / --no-trust-remote-code
Whether to allow the code of the model to be
downloaded and executed from the Hugging
Face Model Hub
--batch-size INTEGER Batch size for evaluation.
--temperature FLOAT Sampling temperature for model.
--top-p FLOAT Threshold value for selecting top tokens
based on their cumulative distribution
probability during sampling.
--do-sample / --no-do-sample Whether to use sampling for model
generation.
--pretty-name TEXT A human-friendly name for the model.
--link TEXT Link to the model's location.
--device-map TEXT Device map for multi-GPU model loading.
--force-overwrite / --no-force-overwrite
Flag to force to overwrite the results file
if it already exists \[default: no-force-
overwrite]
--export-to-yaml / --no-export-to-yaml
Whether to export the config to a YAML file
or not. \[default: no-export-to-yaml]
--max-instances INTEGER The number of AlpacaEval instances for model
execution and evaluation; defaults to
evaluating all 805 prompts if not set.
compute-concurrency
Run concurrency evaluation for a given number of users on a given model using a prompt length and number of tokens to generate.
Usage:
bench compute-concurrency [OPTIONS]
Options:
--model-path PATH The path of the model to be evaluated.
\[required]
--n-users INTEGER The number of concurrent users. \[required]
--prompt-len INTEGER The length of the system prompt combined
with the user's input question. \[required]
--num-tokens INTEGER The number of tokens to be generated.
\[required]
--num-gpus INTEGER Number of gpus to load the model.
\[default: 1]
--overall-timing / --no-overall-timing
The timing will be on the overall tokens.
\[default: no-overall-timing]
--output-path PATH The output path for the results. if not
provided it will use the default path
\[default: results/compute-concurrency]
--load-8bit / --no-load-8bit Load the model in 8-bit \[default: no-
load-8bit]
--gptq-mode / --no-gptq-mode Load the model in GPTQ mode \[default: no-
gptq-mode]
--device [cuda|cpu] The device type to load the model. Options:
['cpu', 'cuda'] \[default: cuda]
--force-overwrite / --no-force-overwrite
Whether to override the output or not
\[default: no-force-overwrite]
compute-elo-rating
Compute the Elo rating for LLM models using the pairwise battle outcomes results present in the provided JSON file.
Usage:
bench compute-elo-rating [OPTIONS] FILE_PATH MODEL_1_KEY MODEL_2_KEY PREFERENCE_KEY
Options:
FILE_PATH Path to the input JSON file which contains
models pairwise battles and results.
\[required]
MODEL_1_KEY The key of the entry corresponding to the
first model within the JSON data.
\[required]
MODEL_2_KEY The key of the entry corresponding to the
second model within the JSON data.
\[required]
PREFERENCE_KEY The key of the entry that indicates the
winning model for each pairwise battle in
the data. \[required]
--output-path PATH Destination path for saving Elo rating
results. If not provided, default location
will be used. \[default: results/compute-
elo-rating]
--force-overwrite / --no-force-overwrite
Flag to force to overwrite the results file
if it already exists \[default: no-force-
overwrite]
--k INTEGER The K-factor, which determines how much the
ratings are adjusted after each match
\[default: 32]
--scale INTEGER The scale parameter determines the range of
rating differences that have a significant
impact on the expected score. \[default:
400]
--base INTEGER The base parameter is the base of the
exponential function used to calculate the
expected score \[default: 10]
--initial-rate INTEGER The initial rating assigned to each model
\[default: 1000]
compute-lm-eval
CLI command to calculate LM Evaluation Harness.
Usage:
bench compute-lm-eval [OPTIONS]
Options:
--model-path PATH Path of the model, can be a checkpoint or
from hub. \[required]
--task TEXT The task the model needs to be evaluated on,
example: [arc_challenge, mmlu, hellaswag,
...etc]. \[required]
--model TEXT Type of model, example: beyondgpt, hf, hf-
causal. \[default: beyondgpt]
--model-args TEXT Arguments to pass directly to the model,
example: --model-args
trust_remote_code=True, to pass multiple
model arguments they should be seperated by
a comma, example: --model-args
argument_1=value_1,argument_2=value_2.
--num-fewshot INTEGER Number of few shots to run the evaluation
with. \[default: 0]
--batch-size INTEGER Model batch size. \[default: 1]
--output-path PATH Path to store the results. \[default:
results/compute-lm-eval/]
--force-overwrite / --no-force-overwrite
Force overwrite the results. \[default: no-
force-overwrite]
--load-in-8bit / --no-load-in-8bit
Whither to load the model in 8bit mode.
\[default: no-load-in-8bit]
--gptq-mode / --no-gptq-mode Whither to load the model in GPTQ mode.
\[default: no-gptq-mode]
--device TEXT The device to run the evaluation on.
\[default: cuda]
--limit FLOAT The number of sentences to evaluate per
task, or percentage of sentences to eveluate
from the eval dataset.
--num-gpus INTEGER Number of GPUs used for evaluation.
\[default: 1]
compute-perplexity
Compute the perplexity using a provided text completion jsonl file.
Usage:
bench compute-perplexity [OPTIONS]
Options:
--file-path PATH The input jsonl file path that contains the
text completion sentences to be evaluated.
must be in form {'text' : sentence1 }
\[required]
--model-path PATH The path of the model to be evaluated.
\[required]
--max-length INTEGER The max sequence length to be generation.
\[required]
--load-8bit / --no-load-8bit Load the model in 8-bit \[default: no-
load-8bit]
--gptq-mode / --no-gptq-mode Load the model in GPTQ mode \[default: no-
gptq-mode]
--num-gpus INTEGER Number of gpus to load the model.
\[default: 1]
--device [cuda|cpu] The device type to load the model
\[default: cuda]
--batch-size INTEGER Batch size for inference \[default: 2]
--add-sos / --no-add-sos Whether to add start of sentence in the
inference or not \[default: no-add-sos]
--output-path PATH The output path for the results. if not
provided it will use the default path
\[default: results/compute-perplexity]
--force-overwrite / --no-force-overwrite
Whether to override the output or not
\[default: no-force-overwrite]
compute-rag
Compute and evaluate the RAG metric scores such as: ragas_score, context_precision, faithfulness, answer_relevancy.
Usage:
bench compute-rag [OPTIONS]
Options:
--model-path PATH The path of the model to be evaluated.
\[required]
--dataset-path PATH The input jsonl file path that contains the
question, contexts and ground_truths.
\[required]
--framework [ragas|llama_index]
The RAG evaluation framework. By default =
llama_index \[default: llama_index]
--evaluator [gpt-4|gpt-4-1106-preview|gpt-3.5-turbo|gpt-3.5-turbo-1106]
The evaluator model. By default =
gpt-4-1106-preview \[default:
gpt-4-1106-preview]
--guided-generation / --no-guided-generation
Whether to use guided generation in the
evaluation or not. Currently vLLM served
models is supported \[default: no-guided-
generation]
--api-base-url TEXT OpenAI compatible API
--metrics [faithfulness|relevancy|correctness]
The metrics to evaluate. If None, all
metrics (Faithfulness, Relevancy,
Correctness) will be used. By default = None
--max-new-tokens INTEGER The max sequence length to be generation.
\[default: 1024]
--temperature FLOAT The temperature of the model to be
evaluated. \[default: 0.0]
--max-gpu-memory TEXT The max GPU memory
--cpu-offloading / --no-cpu-offloading
Flag used to offload the operations to cpu
\[default: no-cpu-offloading]
--load-8bit / --no-load-8bit Load the model in 8-bit \[default: no-
load-8bit]
--gptq-mode / --no-gptq-mode Load the model in GPTQ mode \[default: no-
gptq-mode]
--use-safetensors / --no-use-safetensors
Use safetensors instead of .bin model files
\[default: no-use-safetensors]
--disable-exllama / --no-disable-exllama
Disable exllama fusing \[default: no-
disable-exllama]
--num-gpus INTEGER Number of gpus to load the model.
\[default: 1]
--device [cuda|cpu] The device type to load the model
\[default: cuda]
--output-path PATH The output path for the results. if not
provided it will use the default path
\[default: results/compute-rag]
--force-overwrite / --no-force-overwrite
Whether to override the output or not
\[default: no-force-overwrite]
download-model
Download the model specified by deployed_model configuration.
Usage:
bench download-model [OPTIONS]
evaluate-pass-k
Evaluate a model on HumanEval dataset and report pass@k values.
Usage:
bench evaluate-pass-k [OPTIONS] MODEL_NAME
Options:
MODEL_NAME The model to be evaluated. \[required]
--load-in-fp16 / --no-load-in-fp16
If true, will load the model in half
precision \[default: no-load-in-fp16]
--load-in-8bit / --no-load-in-8bit
If true, will convert the loaded model into
mixed-8bit quantized model. \[default: no-
load-in-8bit]
--load-in-4bit / --no-load-in-4bit
If true, will convert the loaded model into
4bit quantized model. \[default: no-load-
in-4bit]
--weight-sharding / --no-weight-sharding
If true, will load the model in shards
\[default: no-weight-sharding]
--k TEXT the k values in pass@k passed as a comma
sperated string. \[default: 1,10]
--top-p FLOAT If set to < 1, only the smallest set of most
probable tokens with probabilities that add
up to top_p or higher are kept for
generation. \[default: 0.95]
--top-k INTEGER Top-k parameter used for generation.
\[default: 0]
--temperature FLOAT The value used to modulate the next token
probabilities. \[default: 0.9]
--num-beams INTEGER Number of beams for beam search. \[default:
2]
--max-new-tokens INTEGER The maximum number of newly generated
tokens. \[default: 256]
--n-samples INTEGER Number of codes to generate for each sample.
\[default: 200]
--num-parallel-generations INTEGER
Number of codes to generate in parallel in
one inference \[default: 10]
--num-tasks INTEGER The number of human-eval tasks to run. If
not included all tasks are evaluated.
evaluate-stats
Evaluate Model Efficincy Stats.
Usage:
bench evaluate-stats [OPTIONS] MODEL_NAME
Options:
MODEL_NAME The model to be evaluated. \[required]
--load-in-fp16 / --no-load-in-fp16
If true, will load the model in half
precision` \[default: no-load-in-fp16]
--load-in-8bit / --no-load-in-8bit
If true, will convert the loaded model into
mixed-8bit quantized model. \[default: no-
load-in-8bit]
--load-in-4bit / --no-load-in-4bit
If true, will convert the loaded model into
4bit quantized model.` \[default: no-load-
in-4bit]
--weight-sharding / --no-weight-sharding
If true, will load the model in shards`
\[default: no-weight-sharding]
--write-results / --no-write-results
If true, results will be written to file
`{model_name}.txt` \[default: write-
results]
--max-new-tokens INTEGER The maximum number of newly generated
tokens. \[default: 256]
--top-p FLOAT If set to < 1, only the smallest set of most
probable tokens with probabilities that add
up to top_p or higher are kept for
generation. \[default: 0.95]
--top-k INTEGER Top-k parameter used for generation.
\[default: 0]
--temperature FLOAT The value used to modulate the next token
probabilities. \[default: 0.9]
--num-beams INTEGER Number of beams for beam search..
\[default: 2]
--repetitions INTEGER How many times to run the inference. The
calculated characteristics will be the
average of all runs results \[default: 1]
--num-tasks INTEGER The number of human-eval tasks to run. If
not included all tasks are evaluated.
get-config
Retrieive the value of a configuration parameter.
Usage:
bench get-config [OPTIONS] PARAMETER
Options:
PARAMETER \[required]
mt-bench
MT-Benchmarking.
Usage:
bench mt-bench [OPTIONS]
Options:
--model-id TEXT The model id \[required]
--task [generate-answers|judge|show-results|all]
The targeted task \[default: all]
--logging-path PATH The logging path \[default: results/mt-
bench/]
--gen-api-url TEXT The API server URL for generation.
--gen-api-key TEXT The API server key for generation.
--gen-model-name TEXT The API server model name for generation.
--timeout INTEGER The timeout for the request sent to the API
server. \[default: 30]
--use-api / --no-use-api Wether to use a hosted API server or not.
\[default: no-use-api]
--judge-api-base-url TEXT The API server Base URL for Judging.
--judge-api-key TEXT The API server key for Judging.
--judge-model-name TEXT The API server model name for Judging.
--model-path TEXT The model path
--models-answer-dir PATH The model answers directory
--judges-dir PATH The model judgement directory
--question-begin INTEGER The questions starting index
--question-end INTEGER The questions end index
--device [cuda|cpu] The device type to load the mdoel
\[default: cuda]
--max-new-token INTEGER The max number of tokens to be generated
\[default: 1024]
--num-choices INTEGER The number of answers to be generated
\[default: 1]
--num-gpus INTEGER The number of GPUs \[default: 1]
--max-gpu-memory TEXT The max GPU memory
--load-8bit / --no-load-8bit Flag to load the model using 8 bits
quantization \[default: no-load-8bit]
--cpu-offloading / --no-cpu-offloading
Flag used to offload the operations to cpu
\[default: no-cpu-offloading]
--gptq-mode / --no-gptq-mode A flag used to load GPTQ model \[default:
no-gptq-mode]
--force-overwrite / --no-force-overwrite
A flag to whither overwrite the files or not
\[default: no-force-overwrite]
--judge-model TEXT The judge model to be used \[default:
gpt-4]
--baseline-model TEXT The basline model for pairwise benchmarking
\[default: gpt-3.5-turbo]
--first-n INTEGER Judging the first n questions
--parallel INTEGER The number of parallel request to make
\[default: 1]
--model-list TEXT The model list comma separated (i.e
model1,model2)
--mode TEXT The benchmarking mode \[default: single]
--interrupt-before-judgement / --no-interrupt-before-judgement
A flag to interrupt the execution before
starting the judgment \[default: interrupt-
before-judgement]
print-model-zoo
View the List of code generation models that we can evaluate.
Usage:
bench print-model-zoo [OPTIONS]
serve-docs
Run the mkdocs server to serve the documentation site.
Usage:
bench serve-docs [OPTIONS]
streamlit
Run the Streamlit app associated with the project.
Usage:
bench streamlit [OPTIONS]
webservice
Run the RESTful webservice associated with the project.
Usage:
bench webservice [OPTIONS]