Skip to main content
LLMEngine is the main entry point for running inference with miniVLLM. It manages worker processes for multi-GPU tensor parallelism, a token-level scheduler with paged KV cache, and a model runner that supports CUDA graph replay for fast decode.

Constructor

LLMEngine(config: dict)
Creates the engine, spawns worker processes for ranks 1..world_size-1, initialises the rank-0 ModelRunner (which triggers weight loading, warmup, KV cache allocation, and CUDA graph capture), then creates the Scheduler.
The Scheduler is initialised after ModelRunner because ModelRunner.__init__ calls dist.init_process_group(), a collective barrier that blocks until every worker rank has joined. Creating the scheduler before that barrier would deadlock on multi-GPU setups.

Config parameters

The config dict is shared between the scheduler, the model runner, and memory management. All keys consumed by ModelRunner must be present.

Scheduling and memory

model_name_or_path
string
required
HuggingFace model ID or local path to the model checkpoint. Currently supported: "Qwen/Qwen3-0.6B" and "meta-llama/Llama-3.2-1B-Instruct".The model is identified by the final path component (e.g. "Qwen3-0.6B"), so local paths and HF IDs both work as long as the name matches a supported model.
world_size
int
default:"1"
Number of GPUs to use for tensor-parallel inference. Ranks 1..world_size-1 are spawned as separate processes and communicate via NCCL.
max_num_sequences
int
default:"16"
Maximum number of sequences concurrently active in the scheduler (passed to Scheduler).
max_num_batched_tokens
int
default:"1024"
Maximum total tokens across all sequences in a single forward pass (passed to Scheduler).
max_cached_blocks
int
default:"1024"
Initial KV cache block pool size hint. Overridden at runtime by ModelRunner.allocate_kv_cache(), which measures actual free GPU memory and writes the true value back into this key.
block_size
int
default:"256"
Tokens per KV cache block. Must be consistent across engine, scheduler, and attention layers.
eos
int
default:"50256"
EOS token ID used by the scheduler to stop generation.
The default 50256 is the GPT-2 EOS token. For Qwen3-0.6B use 151645; verify your tokenizer’s eos_token_id for other models.
enforce_eager
bool
default:"False"
When True, disables CUDA graph capture so all forward passes run eagerly. Useful for debugging. When False, CUDA graphs are captured during initialization for fast decode replay.
gpu_memory_utilization
float
default:"0.9"
Fraction of free GPU memory to use for the KV cache pool. 0.9 means 90% of available memory after model weights is used for blocks.
max_num_batch_tokens
int
required
Maximum tokens in a single warmup batch. Used by ModelRunner.warmup_model() to size the dry-run forward pass: batch_size = max_num_batch_tokens // max_model_length.
max_model_length
int
required
Maximum total sequence length (prompt + generated tokens). Used both for warmup sizing and for CUDA graph buffer pre-allocation.

Model architecture

These keys are required when using Qwen/Qwen3-0.6B:
vocab_size
int
required
Vocabulary size. For Qwen3-0.6B: 151936.
hidden_size
int
required
Model hidden dimension. For Qwen3-0.6B: 1024.
num_heads
int
required
Total query attention heads. For Qwen3-0.6B: 16.
head_dim
int
required
Attention head dimension. For Qwen3-0.6B: 128.
num_kv_heads
int
required
Key/value head count (GQA). For Qwen3-0.6B: 8.
intermediate_size
int
required
MLP hidden dimension. For Qwen3-0.6B: 3072.
num_layers
int
required
Number of transformer decoder layers. For Qwen3-0.6B: 28.
tie_word_embeddings
bool
required
Whether to share the embedding and LM head weights. For Qwen3-0.6B: True.
base
int
required
RoPE base frequency. For Qwen3-0.6B: 1000000.
rms_norm_epsilon
float
required
Epsilon for RMSNorm layers. For Qwen3-0.6B: 1e-6.
qkv_bias
bool
required
Whether QKV projections include bias. For Qwen3-0.6B: False.
scale
float
required
Attention scale multiplier. Typically 1.0.
max_position
int
required
Maximum position index for the RoPE cache. Must be >= max_model_length. For Qwen3-0.6B: 32768.
ffn_bias
bool
required
Whether MLP projections include bias. For Qwen3-0.6B: False.

Methods

generate

llm.generate(prompts: list[str], sampling_params: SamplingParams) -> dict
Tokenizes every prompt, adds all sequences to the scheduler, then calls step() in a loop until every sequence has finished. Returns a dict with decoded text and raw token IDs, sorted in the same order as the input prompts.

Parameters

prompts
list[str]
required
List of plain-text prompt strings. Each string is encoded with the model’s tokenizer before being submitted to the scheduler.
sampling_params
SamplingParams
required
Sampling configuration applied to every prompt in this batch. See SamplingParams for field details.

Return value

text
list[str]
Decoded completion strings, one per input prompt, in the same order.
token_ids
list[list[int]]
Raw completion token IDs (prompt tokens excluded), one list per input prompt.

add_prompt

llm.add_prompt(prompt: str, sampling_params: SamplingParams) -> None
Tokenizes prompt and enqueues it as a new Sequence in the scheduler’s waiting queue. Use this method together with step() when you need fine-grained control over the generation loop.
prompt
str
required
Plain-text prompt string.
sampling_params
SamplingParams
required
Sampling configuration for this sequence.

step

llm.step() -> tuple[list[tuple[int, list[int]]], int, bool]
Runs one scheduling + forward-pass iteration:
  1. Calls Scheduler.schedule() to select the next batch and determine whether it is a prefill or a decode step.
  2. Calls ModelRunner.run() via IPC to execute the forward pass and sample one token per sequence.
  3. Calls Scheduler.postprocess() to append sampled tokens and check stopping conditions.
  4. Returns metadata for sequences that finished during this step.

Return value

outputs
list[tuple[int, list[int]]]
List of (seq_id, completion_token_ids) pairs for sequences that finished in this step. Empty when no sequence finished.
num_processed_tokens
int
During prefill: total tokens in the scheduled sequences. During decode: number of sequences stepped (one token each).
is_prefill
bool
True when the step processed a prefill batch; False for a decode batch.

exit

llm.exit() -> None
Gracefully shuts down the engine. Sends an "exit" IPC call to all worker processes, deletes the ModelRunner, and joins worker processes. This method is also registered with atexit and called automatically when the Python process exits.

Usage example

from myvllm.engine.llm_engine import LLMEngine as LLM
from myvllm.sampling_parameters import SamplingParams
from transformers import AutoTokenizer

config = {
    # Scheduling and memory
    "model_name_or_path": "Qwen/Qwen3-0.6B",
    "world_size": 1,
    "max_num_sequences": 16,
    "max_num_batched_tokens": 1024,
    "max_cached_blocks": 1024,
    "block_size": 256,
    "eos": 151645,
    "enforce_eager": True,
    "gpu_memory_utilization": 0.9,
    "max_num_batch_tokens": 4096,
    "max_model_length": 128,
    # Qwen3-0.6B model architecture
    "vocab_size": 151936,
    "hidden_size": 1024,
    "num_heads": 16,
    "head_dim": 128,
    "num_kv_heads": 8,
    "intermediate_size": 3072,
    "num_layers": 28,
    "tie_word_embeddings": True,
    "base": 1000000,
    "rms_norm_epsilon": 1e-6,
    "qkv_bias": False,
    "scale": 1,
    "ffn_bias": False,
    "max_position": 32768,
}

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
llm = LLM(config=config)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256, max_model_length=128)

prompts = [
    "introduce yourself",
    "list all prime numbers within 100",
]
# Apply chat template before passing to the engine
prompts = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": p}],
        tokenize=False,
        add_generation_prompt=True,
    )
    for p in prompts
]

outputs = llm.generate(prompts, sampling_params)

for prompt, text in zip(prompts, outputs["text"]):
    print(f"Prompt:     {prompt}")
    print(f"Completion: {text}")