LLMEngine is the main entry point for running inference with miniVLLM. It manages worker processes for multi-GPU tensor parallelism, a token-level scheduler with paged KV cache, and a model runner that supports CUDA graph replay for fast decode.
Constructor
1..world_size-1, initialises the rank-0 ModelRunner (which triggers weight loading, warmup, KV cache allocation, and CUDA graph capture), then creates the Scheduler.
The
Scheduler is initialised after ModelRunner because ModelRunner.__init__ calls dist.init_process_group(), a collective barrier that blocks until every worker rank has joined. Creating the scheduler before that barrier would deadlock on multi-GPU setups.Config parameters
The config dict is shared between the scheduler, the model runner, and memory management. All keys consumed byModelRunner must be present.
Scheduling and memory
HuggingFace model ID or local path to the model checkpoint.
Currently supported:
"Qwen/Qwen3-0.6B" and "meta-llama/Llama-3.2-1B-Instruct".The model is identified by the final path component (e.g. "Qwen3-0.6B"), so local paths and HF IDs both work as long as the name matches a supported model.Number of GPUs to use for tensor-parallel inference. Ranks
1..world_size-1
are spawned as separate processes and communicate via NCCL.Maximum number of sequences concurrently active in the scheduler (passed to
Scheduler).Maximum total tokens across all sequences in a single forward pass (passed to
Scheduler).Initial KV cache block pool size hint. Overridden at runtime by
ModelRunner.allocate_kv_cache(), which measures actual free GPU memory and writes the true value back into this key.Tokens per KV cache block. Must be consistent across engine, scheduler, and attention layers.
EOS token ID used by the scheduler to stop generation.
When
True, disables CUDA graph capture so all forward passes run eagerly. Useful for debugging. When False, CUDA graphs are captured during initialization for fast decode replay.Fraction of free GPU memory to use for the KV cache pool.
0.9 means 90% of available memory after model weights is used for blocks.Maximum tokens in a single warmup batch. Used by
ModelRunner.warmup_model() to size the dry-run forward pass: batch_size = max_num_batch_tokens // max_model_length.Maximum total sequence length (prompt + generated tokens). Used both for warmup sizing and for CUDA graph buffer pre-allocation.
Model architecture
These keys are required when usingQwen/Qwen3-0.6B:
Vocabulary size. For Qwen3-0.6B:
151936.Model hidden dimension. For Qwen3-0.6B:
1024.Total query attention heads. For Qwen3-0.6B:
16.Attention head dimension. For Qwen3-0.6B:
128.Key/value head count (GQA). For Qwen3-0.6B:
8.MLP hidden dimension. For Qwen3-0.6B:
3072.Number of transformer decoder layers. For Qwen3-0.6B:
28.Whether to share the embedding and LM head weights. For Qwen3-0.6B:
True.RoPE base frequency. For Qwen3-0.6B:
1000000.Epsilon for RMSNorm layers. For Qwen3-0.6B:
1e-6.Whether QKV projections include bias. For Qwen3-0.6B:
False.Attention scale multiplier. Typically
1.0.Maximum position index for the RoPE cache. Must be
>= max_model_length. For Qwen3-0.6B: 32768.Whether MLP projections include bias. For Qwen3-0.6B:
False.Methods
generate
step() in a loop until every sequence has finished. Returns a dict with decoded text and raw token IDs, sorted in the same order as the input prompts.
Parameters
List of plain-text prompt strings. Each string is encoded with the model’s
tokenizer before being submitted to the scheduler.
Sampling configuration applied to every prompt in this batch. See
SamplingParams for field details.Return value
Decoded completion strings, one per input prompt, in the same order.
Raw completion token IDs (prompt tokens excluded), one list per input prompt.
add_prompt
prompt and enqueues it as a new Sequence in the scheduler’s waiting queue. Use this method together with step() when you need fine-grained control over the generation loop.
Plain-text prompt string.
Sampling configuration for this sequence.
step
- Calls
Scheduler.schedule()to select the next batch and determine whether it is a prefill or a decode step. - Calls
ModelRunner.run()via IPC to execute the forward pass and sample one token per sequence. - Calls
Scheduler.postprocess()to append sampled tokens and check stopping conditions. - Returns metadata for sequences that finished during this step.
Return value
List of
(seq_id, completion_token_ids) pairs for sequences that finished
in this step. Empty when no sequence finished.During prefill: total tokens in the scheduled sequences.
During decode: number of sequences stepped (one token each).
True when the step processed a prefill batch; False for a decode batch.exit
"exit" IPC call to all worker processes, deletes the ModelRunner, and joins worker processes. This method is also registered with atexit and called automatically when the Python process exits.