Skip to content

CLI Reference

Docker Usage

If provided with a docker image, you can run any of the commands mentioned in this reference by prefixing it with:

docker run --gpus all -v $(pwd):/directory
This will mount your current working directory on your local disk onto the path /directory inside the container. This would allow you to run the bench commands on the files located in your current working directory on your local disk. For example, to run the compute-perplexity command via docker, we'll use:
docker run --gpus all -v $(pwd):/directory bench compute-perplexity \
    --file-path  ./input_path/file.json  \
    --model-path ./results/perplexity \
    --max-length 2048
If you need to pass an API key to the docker image, you can pass as following:
docker run --gpus all -v $(pwd):/directory -e API_KEY=<KEY> bench compute-alpaca-eval-from-model  \
    --model-configs <MODEL-CONFIGS> \
    --output-path <OUTPUT-PATH>

Docker Usage with a License key

If provided with a docker image and a license key, you can run any of the commands mentioned in this reference by prefixing it with

docker run -v <LICENSE-KEY-DIR-PATH>:/bench/runtime_key -e PYARMOR_RKEY=/bench/runtime_key -v $(pwd):/directory
This will do two things:

  • Mount the local path of your provided license key on your disk onto the the path /bench/runtime_key inside the docker container, and then set the environment variable PYARMOR_RKEY pointing to that mounted directory containing your license key.

  • Mount your current working directory on your local disk onto the path /directory inside the container. This would allow you to run the bench commands on the files located in your current working directory on your local disk.

For example, to run the compute-perplexity command via docker, we'll use (assuming your license key is locally under /keys/bench):

docker run --gpus all -it -v /keys/bench:/bench/runtime_key -e PYARMOR_RKEY=/bench/runtime_key -v $(pwd):/directory bench compute-perplexity \
    --file-path  ./input_path/file.json  \
    --model-path ./results/perplexity \
    --max-length 2048

bench

BeyondBench is a tool designed to evaluate LLM models with various task evaluations.

Usage:

bench [OPTIONS] COMMAND [ARGS]...

Options:

  --install-completion  Install completion for the current shell.
  --show-completion     Show completion for the current shell, to copy it or
                        customize the installation.

benchmark-serving

Evaluate a model's performance by sending requests to an API, measuring metrics like latency, throughput, TTFT and TPOT.

Usage:

bench benchmark-serving [OPTIONS]

Options:

  --model-id TEXT                 Request payload: The model identifier sent
                                  in the request payload  \[required]
  --tokenizer-id TEXT             The tokenizer id that used to get the length
                                  in tokens for the request and response text
                                  \[required]
  --base-url TEXT                 The base URL for the API endpoint.
                                  \[required]
  --result-path TEXT              Result Path to save the benchmark results
                                  and the figure  \[required]
  --trust-remote-code / --no-trust-remote-code
                                  Trust remote code from huggingface to load
                                  the tokenizer  \[default: no-trust-remote-
                                  code]
  --endpoint TEXT                 The specific endpoint to use for serving
                                  requests. We can support more endpoints in
                                  the future.  \[default:
                                  /v1/chat/completions]
  --dataset-path TEXT             The path to the dataset file that will be
                                  used to generate requests for benchmarking
  --story-length TEXT             The length category of the stories ('1k',
                                  '2k', '3k', '4k')  \[default: 1k]
  --plot-title-description TEXT   Description for the plot title
  --best-of INTEGER               Request payload: Generates `best_of`
                                  sequences per prompt and returns the best
                                  one.  \[default: 1]
  --use-beam-search / --no-use-beam-search
                                  Request payload: Flag to indicate whether to
                                  use beam search for generation.  \[default:
                                  no-use-beam-search]
  --max-tokens INTEGER            Request payload: Maximum number of tokens to
                                  generate in the response.  \[default: 1024]
  --temperature FLOAT             Request payload: Sampling temperature to
                                  control randomness of output.  \[default:
                                  0.0]
  --num-requests INTEGER          Number of requests to generate for the
                                  benchmark.  \[default: 20]
  --request-rate FLOAT            Number of requests per second. If this is
                                  inf, then all the requests are sent at time
                                  0. Otherwise, we use Poisson process to
                                  synthesize the request arrival times.
                                  \[default: inf]
  --seed INTEGER                  Random seed for reproducibility  \[default:
                                  0]
  --force-overwrite / --no-force-overwrite
                                  Flag to determine whether to overwrite
                                  existing results.  \[default: no-force-
                                  overwrite]

compute-alpaca-eval

Compute Alpaca Evaluation from output files.

This function is used to compute the Alpaca evaluation for a model using its outputs. The results are compared against the reference outputs using the specified annotator configuration.

Usage:

bench compute-alpaca-eval [OPTIONS] MODEL_OUTPUTS

Options:

  MODEL_OUTPUTS                   Path to the outputs of the model under
                                  evalution.  \[required]
  --reference-outputs PATH        Path to the outputs of the reference model.
                                  If not provided, we use the default
                                  Davinci003 outputs on the AlpacaEval set.
  --annotator-config TEXT         Path to annotator yaml config file.
  --output-path PATH              Destination path for saving Aplaca
                                  evaluation results. If not provided, default
                                  location will be used.  \[default:
                                  results/compute-alpaca-eval]
  --force-overwrite / --no-force-overwrite
                                  Flag to force to overwrite the results file
                                  if it already exists  \[default: no-force-
                                  overwrite]

compute-alpaca-eval-from-model

Evaluate a model againts reference models.

This function is a tool for evaluating language models from Hugging Face, API providers, or local trained/fine-tuned models. Initially, it executes the model on the Alpaca-eval dataset. Following this, an annotator evaluates the output by comparing it to a reference model. Finally, the function calculates the win_rate of the models and incorporates them into the leaderboard.

Notes

  • When the model_config file is provided, the function will attempt to load and possibly modify its parameters based on other given arguments.
  • If no model_config file is provided but a model_name is given, a default configuration template is used.
  • For configurations or options not explicitly passed as arguments, default values or behaviors are assumed.

Usage:

bench compute-alpaca-eval-from-model [OPTIONS]

Options:

  --model-config TEXT             The path to a yaml file containing the
                                  configuration of the model to decode from.
  --annotator-config TEXT         Name of annotators configuration. If None,
                                  we use the default annotators configuration.
                                  \[default: alpaca_eval_gpt4]
  --reference-model-configs TEXT  The path to a yaml file containing the
                                  configuration of the reference model. If not
                                  provided, we use the default Davinci003
                                  outputs on the AlpacaEval set.
  --output-path PATH              Destination path for saving Aplaca
                                  evaluation results. If not provided, default
                                  location will be used.  \[default:
                                  results/compute-alpaca-eval]
  --model-name TEXT               The name or local path of the model. Used if
                                  no model_config file is passed.
  --model-config-name TEXT        The display name of the model on leadboard.
                                  Used if no model_config file is passed.
  --prompt-template TEXT          Path to the prompt template.
  --max-new-tokens INTEGER        Maximum number of tokens for model outputs.
  --mode TEXT                     Loading a model in gptq mode or not.
  --fn-completions TEXT           Function from `alpaca_farm.decoders` for
                                  completions.
  --torch-dtype TEXT              Data type for PyTorch model. Eg. 'float16'.
  --trust-remote-code / --no-trust-remote-code
                                  Whether to allow the code of the model to be
                                  downloaded and executed from the Hugging
                                  Face Model Hub
  --batch-size INTEGER            Batch size for evaluation.
  --temperature FLOAT             Sampling temperature for model.
  --top-p FLOAT                   Threshold value for selecting top tokens
                                  based on their cumulative distribution
                                  probability during sampling.
  --do-sample / --no-do-sample    Whether to use sampling for model
                                  generation.
  --pretty-name TEXT              A human-friendly name for the model.
  --link TEXT                     Link to the model's location.
  --device-map TEXT               Device map for multi-GPU model loading.
  --force-overwrite / --no-force-overwrite
                                  Flag to force to overwrite the results file
                                  if it already exists  \[default: no-force-
                                  overwrite]
  --export-to-yaml / --no-export-to-yaml
                                  Whether to export the config to a YAML file
                                  or not.  \[default: no-export-to-yaml]
  --max-instances INTEGER         The number of AlpacaEval instances for model
                                  execution and evaluation; defaults to
                                  evaluating all 805 prompts if not set.

compute-concurrency

Run concurrency evaluation for a given number of users on a given model using a prompt length and number of tokens to generate.

Usage:

bench compute-concurrency [OPTIONS]

Options:

  --model-path PATH               The path of the model to be evaluated.
                                  \[required]
  --n-users INTEGER               The number of concurrent users.  \[required]
  --prompt-len INTEGER            The length of the system prompt combined
                                  with the user's input question.  \[required]
  --num-tokens INTEGER            The number of tokens to be generated.
                                  \[required]
  --num-gpus INTEGER              Number of gpus to load the model.
                                  \[default: 1]
  --overall-timing / --no-overall-timing
                                  The timing will be on the overall tokens.
                                  \[default: no-overall-timing]
  --output-path PATH              The output path for the results. if not
                                  provided it will use the default path
                                  \[default: results/compute-concurrency]
  --load-8bit / --no-load-8bit    Load the model in 8-bit  \[default: no-
                                  load-8bit]
  --gptq-mode / --no-gptq-mode    Load the model in GPTQ mode  \[default: no-
                                  gptq-mode]
  --device [cuda|cpu]             The device type to load the model. Options:
                                  ['cpu', 'cuda']  \[default: cuda]
  --force-overwrite / --no-force-overwrite
                                  Whether to override the output or not
                                  \[default: no-force-overwrite]

compute-elo-rating

Compute the Elo rating for LLM models using the pairwise battle outcomes results present in the provided JSON file.

Usage:

bench compute-elo-rating [OPTIONS] FILE_PATH MODEL_1_KEY MODEL_2_KEY PREFERENCE_KEY

Options:

  FILE_PATH                       Path to the input JSON file which contains
                                  models pairwise battles and results.
                                  \[required]
  MODEL_1_KEY                     The key of the entry corresponding to the
                                  first model within the JSON data.
                                  \[required]
  MODEL_2_KEY                     The key of the entry  corresponding to the
                                  second model within the JSON data.
                                  \[required]
  PREFERENCE_KEY                  The key of the entry that indicates the
                                  winning model for each pairwise battle in
                                  the data.  \[required]
  --output-path PATH              Destination path for saving Elo rating
                                  results. If not provided, default location
                                  will be used.  \[default: results/compute-
                                  elo-rating]
  --force-overwrite / --no-force-overwrite
                                  Flag to force to overwrite the results file
                                  if it already exists  \[default: no-force-
                                  overwrite]
  --k INTEGER                     The K-factor, which determines how much the
                                  ratings are adjusted after each match
                                  \[default: 32]
  --scale INTEGER                 The scale parameter determines the range of
                                  rating differences that have a significant
                                  impact on the expected score.  \[default:
                                  400]
  --base INTEGER                  The base parameter is the base of the
                                  exponential function used to calculate the
                                  expected score  \[default: 10]
  --initial-rate INTEGER          The initial rating assigned to each model
                                  \[default: 1000]

compute-lm-eval

CLI command to calculate LM Evaluation Harness.

Usage:

bench compute-lm-eval [OPTIONS]

Options:

  --model-path PATH               Path of the model, can be a checkpoint or
                                  from hub.  \[required]
  --task TEXT                     The task the model needs to be evaluated on,
                                  example: [arc_challenge, mmlu, hellaswag,
                                  ...etc].  \[required]
  --model TEXT                    Type of model, example: beyondgpt, hf, hf-
                                  causal.  \[default: beyondgpt]
  --model-args TEXT               Arguments to pass directly to the model,
                                  example: --model-args
                                  trust_remote_code=True, to pass multiple
                                  model arguments they should be seperated by
                                  a comma, example: --model-args
                                  argument_1=value_1,argument_2=value_2.
  --num-fewshot INTEGER           Number of few shots to run the evaluation
                                  with.  \[default: 0]
  --batch-size INTEGER            Model batch size.  \[default: 1]
  --output-path PATH              Path to store the results.  \[default:
                                  results/compute-lm-eval/]
  --force-overwrite / --no-force-overwrite
                                  Force overwrite the results.  \[default: no-
                                  force-overwrite]
  --load-in-8bit / --no-load-in-8bit
                                  Whither to load the model in 8bit mode.
                                  \[default: no-load-in-8bit]
  --gptq-mode / --no-gptq-mode    Whither to load the model in GPTQ mode.
                                  \[default: no-gptq-mode]
  --device TEXT                   The device to run the evaluation on.
                                  \[default: cuda]
  --limit FLOAT                   The number of sentences to evaluate per
                                  task, or percentage of sentences to eveluate
                                  from the eval dataset.
  --num-gpus INTEGER              Number of GPUs used for evaluation.
                                  \[default: 1]

compute-perplexity

Compute the perplexity using a provided text completion jsonl file.

Usage:

bench compute-perplexity [OPTIONS]

Options:

  --file-path PATH                The input jsonl file path that contains the
                                  text completion sentences to be evaluated.
                                  must be in form {'text' : sentence1 }
                                  \[required]
  --model-path PATH               The path of the model to be evaluated.
                                  \[required]
  --max-length INTEGER            The max sequence length to be generation.
                                  \[required]
  --load-8bit / --no-load-8bit    Load the model in 8-bit  \[default: no-
                                  load-8bit]
  --gptq-mode / --no-gptq-mode    Load the model in GPTQ mode  \[default: no-
                                  gptq-mode]
  --num-gpus INTEGER              Number of gpus to load the model.
                                  \[default: 1]
  --device [cuda|cpu]             The device type to load the model
                                  \[default: cuda]
  --batch-size INTEGER            Batch size for inference  \[default: 2]
  --add-sos / --no-add-sos        Whether to add start of sentence in the
                                  inference or not  \[default: no-add-sos]
  --output-path PATH              The output path for the results. if not
                                  provided it will use the default path
                                  \[default: results/compute-perplexity]
  --force-overwrite / --no-force-overwrite
                                  Whether to override the output or not
                                  \[default: no-force-overwrite]

compute-rag

Compute and evaluate the RAG metric scores such as: ragas_score, context_precision, faithfulness, answer_relevancy.

Usage:

bench compute-rag [OPTIONS]

Options:

  --model-path PATH               The path of the model to be evaluated.
                                  \[required]
  --dataset-path PATH             The input jsonl file path that contains the
                                  question, contexts and ground_truths.
                                  \[required]
  --framework [ragas|llama_index]
                                  The RAG evaluation framework. By default =
                                  llama_index  \[default: llama_index]
  --evaluator [gpt-4|gpt-4-1106-preview|gpt-3.5-turbo|gpt-3.5-turbo-1106]
                                  The evaluator model. By default =
                                  gpt-4-1106-preview  \[default:
                                  gpt-4-1106-preview]
  --guided-generation / --no-guided-generation
                                  Whether to use guided generation in the
                                  evaluation or not. Currently vLLM served
                                  models is supported  \[default: no-guided-
                                  generation]
  --api-base-url TEXT             OpenAI compatible API
  --metrics [faithfulness|relevancy|correctness]
                                  The metrics to evaluate. If None, all
                                  metrics (Faithfulness, Relevancy,
                                  Correctness) will be used. By default = None
  --max-new-tokens INTEGER        The max sequence length to be generation.
                                  \[default: 1024]
  --temperature FLOAT             The temperature of the model to be
                                  evaluated.  \[default: 0.0]
  --max-gpu-memory TEXT           The max GPU memory
  --cpu-offloading / --no-cpu-offloading
                                  Flag used to offload the operations to cpu
                                  \[default: no-cpu-offloading]
  --load-8bit / --no-load-8bit    Load the model in 8-bit  \[default: no-
                                  load-8bit]
  --gptq-mode / --no-gptq-mode    Load the model in GPTQ mode  \[default: no-
                                  gptq-mode]
  --use-safetensors / --no-use-safetensors
                                  Use safetensors instead of .bin model files
                                  \[default: no-use-safetensors]
  --disable-exllama / --no-disable-exllama
                                  Disable exllama fusing  \[default: no-
                                  disable-exllama]
  --num-gpus INTEGER              Number of gpus to load the model.
                                  \[default: 1]
  --device [cuda|cpu]             The device type to load the model
                                  \[default: cuda]
  --output-path PATH              The output path for the results. if not
                                  provided it will use the default path
                                  \[default: results/compute-rag]
  --force-overwrite / --no-force-overwrite
                                  Whether to override the output or not
                                  \[default: no-force-overwrite]

download-model

Download the model specified by deployed_model configuration.

Usage:

bench download-model [OPTIONS]

evaluate-pass-k

Evaluate a model on HumanEval dataset and report pass@k values.

Usage:

bench evaluate-pass-k [OPTIONS] MODEL_NAME

Options:

  MODEL_NAME                      The model to be evaluated.  \[required]
  --load-in-fp16 / --no-load-in-fp16
                                  If true, will load the model in half
                                  precision  \[default: no-load-in-fp16]
  --load-in-8bit / --no-load-in-8bit
                                  If true, will convert the loaded model into
                                  mixed-8bit quantized model.  \[default: no-
                                  load-in-8bit]
  --load-in-4bit / --no-load-in-4bit
                                  If true, will convert the loaded model into
                                  4bit quantized model.  \[default: no-load-
                                  in-4bit]
  --weight-sharding / --no-weight-sharding
                                  If true, will load the model in shards
                                  \[default: no-weight-sharding]
  --k TEXT                        the k values in pass@k passed as a comma
                                  sperated string.  \[default: 1,10]
  --top-p FLOAT                   If set to < 1, only the smallest set of most
                                  probable tokens with probabilities that add
                                  up to top_p or higher are kept for
                                  generation.  \[default: 0.95]
  --top-k INTEGER                 Top-k parameter used for generation.
                                  \[default: 0]
  --temperature FLOAT             The value used to modulate the next token
                                  probabilities.  \[default: 0.9]
  --num-beams INTEGER             Number of beams for beam search.  \[default:
                                  2]
  --max-new-tokens INTEGER        The maximum number of newly generated
                                  tokens.  \[default: 256]
  --n-samples INTEGER             Number of codes to generate for each sample.
                                  \[default: 200]
  --num-parallel-generations INTEGER
                                  Number of codes to generate in parallel in
                                  one inference  \[default: 10]
  --num-tasks INTEGER             The number of human-eval tasks to run. If
                                  not included all tasks are evaluated.

evaluate-stats

Evaluate Model Efficincy Stats.

Usage:

bench evaluate-stats [OPTIONS] MODEL_NAME

Options:

  MODEL_NAME                      The model to be evaluated.  \[required]
  --load-in-fp16 / --no-load-in-fp16
                                  If true, will load the model in half
                                  precision`  \[default: no-load-in-fp16]
  --load-in-8bit / --no-load-in-8bit
                                  If true, will convert the loaded model into
                                  mixed-8bit quantized model.  \[default: no-
                                  load-in-8bit]
  --load-in-4bit / --no-load-in-4bit
                                  If true, will convert the loaded model into
                                  4bit quantized model.`  \[default: no-load-
                                  in-4bit]
  --weight-sharding / --no-weight-sharding
                                  If true, will load the model in shards`
                                  \[default: no-weight-sharding]
  --write-results / --no-write-results
                                  If true, results will be written to file
                                  `{model_name}.txt`  \[default: write-
                                  results]
  --max-new-tokens INTEGER        The maximum number of newly generated
                                  tokens.  \[default: 256]
  --top-p FLOAT                   If set to < 1, only the smallest set of most
                                  probable tokens with probabilities that add
                                  up to top_p or higher are kept for
                                  generation.  \[default: 0.95]
  --top-k INTEGER                 Top-k parameter used for generation.
                                  \[default: 0]
  --temperature FLOAT             The value used to modulate the next token
                                  probabilities.  \[default: 0.9]
  --num-beams INTEGER             Number of beams for beam search..
                                  \[default: 2]
  --repetitions INTEGER           How many times to run the inference. The
                                  calculated characteristics will be the
                                  average of all runs results  \[default: 1]
  --num-tasks INTEGER             The number of human-eval tasks to run. If
                                  not included all tasks are evaluated.

get-config

Retrieive the value of a configuration parameter.

Usage:

bench get-config [OPTIONS] PARAMETER

Options:

  PARAMETER  \[required]

mt-bench

MT-Benchmarking.

Usage:

bench mt-bench [OPTIONS]

Options:

  --model-id TEXT                 The model id  \[required]
  --task [generate-answers|judge|show-results|all]
                                  The targeted task  \[default: all]
  --logging-path PATH             The logging path  \[default: results/mt-
                                  bench/]
  --gen-api-url TEXT              The API server URL for generation.
  --gen-api-key TEXT              The API server key for generation.
  --gen-model-name TEXT           The API server model name for generation.
  --timeout INTEGER               The timeout for the request sent to the API
                                  server.  \[default: 30]
  --use-api / --no-use-api        Wether to use a hosted API server or not.
                                  \[default: no-use-api]
  --judge-api-base-url TEXT       The API server Base URL for Judging.
  --judge-api-key TEXT            The API server key for Judging.
  --judge-model-name TEXT         The API server model name for Judging.
  --model-path TEXT               The model path
  --models-answer-dir PATH        The model answers directory
  --judges-dir PATH               The model judgement directory
  --question-begin INTEGER        The questions starting index
  --question-end INTEGER          The questions end index
  --device [cuda|cpu]             The device type to load the mdoel
                                  \[default: cuda]
  --max-new-token INTEGER         The max number of tokens to be generated
                                  \[default: 1024]
  --num-choices INTEGER           The number of answers to be generated
                                  \[default: 1]
  --num-gpus INTEGER              The number of GPUs  \[default: 1]
  --max-gpu-memory TEXT           The max GPU memory
  --load-8bit / --no-load-8bit    Flag to load the model using 8 bits
                                  quantization  \[default: no-load-8bit]
  --cpu-offloading / --no-cpu-offloading
                                  Flag used to offload the operations to cpu
                                  \[default: no-cpu-offloading]
  --gptq-mode / --no-gptq-mode    A flag used to load GPTQ model  \[default:
                                  no-gptq-mode]
  --force-overwrite / --no-force-overwrite
                                  A flag to whither overwrite the files or not
                                  \[default: no-force-overwrite]
  --judge-model TEXT              The judge model to be used  \[default:
                                  gpt-4]
  --baseline-model TEXT           The basline model for pairwise benchmarking
                                  \[default: gpt-3.5-turbo]
  --first-n INTEGER               Judging the first n questions
  --parallel INTEGER              The number of parallel request to make
                                  \[default: 1]
  --model-list TEXT               The model list comma separated (i.e
                                  model1,model2)
  --mode TEXT                     The benchmarking mode  \[default: single]
  --interrupt-before-judgement / --no-interrupt-before-judgement
                                  A flag to interrupt the execution before
                                  starting the judgment  \[default: interrupt-
                                  before-judgement]

View the List of code generation models that we can evaluate.

Usage:

bench print-model-zoo [OPTIONS]

serve-docs

Run the mkdocs server to serve the documentation site.

Usage:

bench serve-docs [OPTIONS]

streamlit

Run the Streamlit app associated with the project.

Usage:

bench streamlit [OPTIONS]

webservice

Run the RESTful webservice associated with the project.

Usage:

bench webservice [OPTIONS]