LLM settings
These variables control the core language model configuration.| Variable | Default | Type(s) | Description | 
|---|---|---|---|
MODEL_NAME | facebook/opt-125m | str | The name or path of the Hugging Face model to use. | 
MODEL_REVISION | main | str | The model revision to load. | 
TOKENIZER | None | str | The name or path of the Hugging Face tokenizer to use. | 
SKIP_TOKENIZER_INIT | False | bool | If True, skips the initialization of the tokenizer and detokenizer. | 
TOKENIZER_MODE | auto | auto, slow | The tokenizer mode. | 
TRUST_REMOTE_CODE | False | bool | If True, trusts remote code from Hugging Face. | 
DOWNLOAD_DIR | None | str | The directory to download and load the model weights from. | 
LOAD_FORMAT | auto | str | The format of the model weights to load. | 
HF_TOKEN | - | str | Your Hugging Face token, used for private and gated models. | 
DTYPE | auto | auto, half, float16, bfloat16, float, float32 | The data type for model weights and activations. | 
KV_CACHE_DTYPE | auto | auto, fp8 | The data type for KV cache storage. | 
QUANTIZATION_PARAM_PATH | None | str | The path to the JSON file containing the KV cache scaling factors. | 
MAX_MODEL_LEN | None | int | The maximum model context length. | 
GUIDED_DECODING_BACKEND | outlines | outlines, lm-format-enforcer | The default engine for guided decoding. | 
DISTRIBUTED_EXECUTOR_BACKEND | None | ray, mp | The backend to use for distributed serving. | 
WORKER_USE_RAY | False | bool | Deprecated. Use DISTRIBUTED_EXECUTOR_BACKEND=ray instead. | 
PIPELINE_PARALLEL_SIZE | 1 | int | The number of pipeline stages. | 
TENSOR_PARALLEL_SIZE | 1 | int | The number of tensor parallel replicas. | 
MAX_PARALLEL_LOADING_WORKERS | None | int | The number of workers to use for parallel model loading. | 
RAY_WORKERS_USE_NSIGHT | False | bool | If True, uses nsight to profile Ray workers. | 
ENABLE_PREFIX_CACHING | False | bool | If True, enables automatic prefix caching. | 
DISABLE_SLIDING_WINDOW | False | bool | If True, disables the sliding window, capping to the sliding window size. | 
USE_V2_BLOCK_MANAGER | False | bool | If True, uses the BlockSpaceMangerV2. | 
NUM_LOOKAHEAD_SLOTS | 0 | int | The number of lookahead slots, an experimental scheduling configuration for speculative decoding. | 
SEED | 0 | int | The random seed for operations. | 
NUM_GPU_BLOCKS_OVERRIDE | None | int | If specified, this value overrides the GPU profiling result for the number of GPU blocks. | 
MAX_NUM_BATCHED_TOKENS | None | int | The maximum number of batched tokens per iteration. | 
MAX_NUM_SEQS | 256 | int | The maximum number of sequences per iteration. | 
MAX_LOGPROBS | 20 | int | The maximum number of log probabilities to return when logprobs is specified in SamplingParams. | 
DISABLE_LOG_STATS | False | bool | If True, disables logging statistics. | 
QUANTIZATION | None | awq, squeezellm, gptq, bitsandbytes | The method used to quantize the model weights. | 
ROPE_SCALING | None | dict | The RoPE scaling configuration in JSON format. | 
ROPE_THETA | None | float | The RoPE theta value. Use with ROPE_SCALING. | 
TOKENIZER_POOL_SIZE | 0 | int | The size of the tokenizer pool for asynchronous tokenization. | 
TOKENIZER_POOL_TYPE | ray | str | The type of the tokenizer pool for asynchronous tokenization. | 
TOKENIZER_POOL_EXTRA_CONFIG | None | dict | Extra configuration for the tokenizer pool. | 
LoRA settings
Configure LoRA (Low-Rank Adaptation) adapters for your model.| Variable | Default | Type | Description | 
|---|---|---|---|
ENABLE_LORA | False | bool | If True, enables the handling of LoRA adapters. | 
MAX_LORAS | 1 | int | The maximum number of LoRAs in a single batch. | 
MAX_LORA_RANK | 16 | int | The maximum LoRA rank. | 
LORA_EXTRA_VOCAB_SIZE | 256 | int | The maximum size of the extra vocabulary for LoRA adapters. | 
LORA_DTYPE | auto | auto, float16, bfloat16, float32 | The data type for LoRA. | 
LONG_LORA_SCALING_FACTORS | None | tuple | Specifies multiple scaling factors for LoRA adapters. | 
MAX_CPU_LORAS | None | int | The maximum number of LoRAs to store in CPU memory. | 
FULLY_SHARDED_LORAS | False | bool | If True, enables fully sharded LoRA layers. | 
LORA_MODULES | [] | list[dict] | A list of LoRA adapters to add from Hugging Face. Example: [{"name": "adapter1", "path": "user/adapter1"}] | 
Speculative decoding settings
Configure speculative decoding to improve inference performance.| Variable | Default | Type(s) | Description | 
|---|---|---|---|
SCHEDULER_DELAY_FACTOR | 0.0 | float | Applies a delay before scheduling the next prompt. | 
ENABLE_CHUNKED_PREFILL | False | bool | If True, enables chunked prefill requests. | 
SPECULATIVE_MODEL | None | str | The name of the draft model for speculative decoding. | 
NUM_SPECULATIVE_TOKENS | None | int | The number of speculative tokens to sample from the draft model. | 
SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE | None | int | The number of tensor parallel replicas for the draft model. | 
SPECULATIVE_MAX_MODEL_LEN | None | int | The maximum sequence length supported by the draft model. | 
SPECULATIVE_DISABLE_BY_BATCH_SIZE | None | int | Disables speculative decoding if the number of enqueued requests is larger than this value. | 
NGRAM_PROMPT_LOOKUP_MAX | None | int | The maximum window size for ngram prompt lookup in speculative decoding. | 
NGRAM_PROMPT_LOOKUP_MIN | None | int | The minimum window size for ngram prompt lookup in speculative decoding. | 
SPEC_DECODING_ACCEPTANCE_METHOD | rejection_sampler | rejection_sampler, typical_acceptance_sampler | The acceptance method for draft token verification in speculative decoding. | 
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD | None | float | Sets the lower bound threshold for the posterior probability of a token to be accepted. | 
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA | None | float | A scaling factor for the entropy-based threshold for token acceptance. | 
System performance settings
Configure GPU memory and system resource utilization.| Variable | Default | Type(s) | Description | 
|---|---|---|---|
GPU_MEMORY_UTILIZATION | 0.95 | float | The GPU VRAM utilization. | 
MAX_PARALLEL_LOADING_WORKERS | None | int | Loads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models. | 
BLOCK_SIZE | 16 | 8, 16, 32 | The token block size for contiguous chunks of tokens. | 
SWAP_SPACE | 4 | int | The CPU swap space size (in GiB) per GPU. | 
ENFORCE_EAGER | False | bool | If True, always uses eager-mode PyTorch. If False, uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility. | 
MAX_SEQ_LEN_TO_CAPTURE | 8192 | int | The maximum context length covered by CUDA graphs. When a sequence has a context length larger than this, the system falls back to eager mode. | 
DISABLE_CUSTOM_ALL_REDUCE | 0 | int | If 0, enables custom all-reduce. If 1, disables it. | 
Tokenizer settings
Customize tokenizer behavior and chat templates.| Variable | Default | Type(s) | Description | 
|---|---|---|---|
TOKENIZER_NAME | None | str | The tokenizer repository to use a different tokenizer than the model’s default. | 
TOKENIZER_REVISION | None | str | The tokenizer revision to load. | 
CUSTOM_CHAT_TEMPLATE | None | str of single-line jinja template | A custom chat Jinja template. See the Hugging Face documentation for more information. | 
Streaming and batch settings
Control how tokens are streamed back in HTTP responses. These settings control how tokens are batched in HTTP responses when streaming. The batch size starts atDEFAULT_MIN_BATCH_SIZE and increases by a factor of DEFAULT_BATCH_SIZE_GROWTH_FACTOR with each request until it reaches DEFAULT_BATCH_SIZE.
For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM’s internal batching.
| Variable | Default | Type(s) | Description | 
|---|---|---|---|
DEFAULT_BATCH_SIZE | 50 | int | The default and maximum batch size for token streaming. | 
DEFAULT_MIN_BATCH_SIZE | 1 | int | The initial batch size for the first request. | 
DEFAULT_BATCH_SIZE_GROWTH_FACTOR | 3 | float | The growth factor for the dynamic batch size. | 
OpenAI compatibility settings
Configure OpenAI API compatibility features.| Variable | Default | Type(s) | Description | 
|---|---|---|---|
RAW_OPENAI_OUTPUT | 1 | boolean as int | If 1, enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility. | 
OPENAI_SERVED_MODEL_NAME_OVERRIDE | None | str | Overrides the served model name. This allows you to use a custom name in the model parameter of your OpenAI requests. | 
OPENAI_RESPONSE_ROLE | assistant | str | The role of the LLM’s response in OpenAI chat completions. | 
ENABLE_AUTO_TOOL_CHOICE | false | bool | If true, enables automatic tool selection for supported models. | 
TOOL_CALL_PARSER | None | str | The parser for tool calls. | 
REASONING_PARSER | None | str | The parser for reasoning-capable models. Setting this enables reasoning mode. | 
Serverless and concurrency settings
Configure concurrency and logging for Serverless deployments.| Variable | Default | Type(s) | Description | 
|---|---|---|---|
MAX_CONCURRENCY | 300 | int | The maximum number of concurrent requests per worker. vLLM’s internal queue handles VRAM limitations, so this setting is primarily for scaling and load balancing efficiency. | 
DISABLE_LOG_STATS | False | bool | If False, enables vLLM stats logging. | 
DISABLE_LOG_REQUESTS | False | bool | If False, enables vLLM request logging. | 
Advanced settings
Additional configuration options for specialized use cases.| Variable | Default | Type | Description | 
|---|---|---|---|
MODEL_LOADER_EXTRA_CONFIG | None | dict | Extra configuration for the model loader. | 
PREEMPTION_MODE | None | str | The preemption mode. If recompute, the engine performs preemption-aware recomputation. If save, the engine saves activations to CPU memory during preemption. | 
PREEMPTION_CHECK_PERIOD | 1.0 | float | The frequency (in seconds) at which the engine checks for preemption. | 
PREEMPTION_CPU_CAPACITY | 2 | float | The percentage of CPU memory to use for saved activations. | 
DISABLE_LOGGING_REQUEST | False | bool | If True, disables logging requests. | 
MAX_LOG_LEN | None | int | The maximum number of prompt characters or prompt ID numbers to print in the log. | 
Docker build arguments
These variables are used when building custom Docker images with models baked in.| Variable | Default | Type | Description | 
|---|---|---|---|
BASE_PATH | /runpod-volume | str | The storage directory for the Hugging Face cache and model. | 
WORKER_CUDA_VERSION | 12.1.0 | str | The CUDA version for the worker image. |