LLMEngine - miniVLLM

LLMEngine is the main entry point for running inference with miniVLLM. It manages worker processes for multi-GPU tensor parallelism, a token-level scheduler with paged KV cache, and a model runner that supports CUDA graph replay for fast decode.

Constructor

LLMEngine(config: dict)

Creates the engine, spawns worker processes for ranks 1..world_size-1, initialises the rank-0 ModelRunner (which triggers weight loading, warmup, KV cache allocation, and CUDA graph capture), then creates the Scheduler.

The Scheduler is initialised after ModelRunner because ModelRunner.__init__ calls dist.init_process_group(), a collective barrier that blocks until every worker rank has joined. Creating the scheduler before that barrier would deadlock on multi-GPU setups.

Config parameters

The config dict is shared between the scheduler, the model runner, and memory management. All keys consumed by ModelRunner must be present.

Scheduling and memory

model_name_or_path

string

required

HuggingFace model ID or local path to the model checkpoint. Currently supported: "Qwen/Qwen3-0.6B" and "meta-llama/Llama-3.2-1B-Instruct".The model is identified by the final path component (e.g. "Qwen3-0.6B"), so local paths and HF IDs both work as long as the name matches a supported model.

world_size

int

default:"1"

Number of GPUs to use for tensor-parallel inference. Ranks 1..world_size-1 are spawned as separate processes and communicate via NCCL.

max_num_sequences

int

default:"16"

Maximum number of sequences concurrently active in the scheduler (passed to Scheduler).

max_num_batched_tokens

int

default:"1024"

Maximum total tokens across all sequences in a single forward pass (passed to Scheduler).

max_cached_blocks

int

default:"1024"

Initial KV cache block pool size hint. Overridden at runtime by ModelRunner.allocate_kv_cache(), which measures actual free GPU memory and writes the true value back into this key.

block_size

int

default:"256"

Tokens per KV cache block. Must be consistent across engine, scheduler, and attention layers.

eos

int

default:"50256"

EOS token ID used by the scheduler to stop generation.

The default 50256 is the GPT-2 EOS token. For Qwen3-0.6B use 151645; verify your tokenizer’s eos_token_id for other models.

enforce_eager

bool

default:"False"

When True, disables CUDA graph capture so all forward passes run eagerly. Useful for debugging. When False, CUDA graphs are captured during initialization for fast decode replay.

gpu_memory_utilization

float

default:"0.9"

Fraction of free GPU memory to use for the KV cache pool. 0.9 means 90% of available memory after model weights is used for blocks.

max_num_batch_tokens

int

required

Maximum tokens in a single warmup batch. Used by ModelRunner.warmup_model() to size the dry-run forward pass: batch_size = max_num_batch_tokens // max_model_length.

max_model_length

int

required

Maximum total sequence length (prompt + generated tokens). Used both for warmup sizing and for CUDA graph buffer pre-allocation.

Model architecture

These keys are required when using Qwen/Qwen3-0.6B:

vocab_size

int

required

Vocabulary size. For Qwen3-0.6B: 151936.

hidden_size

int

required

Model hidden dimension. For Qwen3-0.6B: 1024.

num_heads

int

required

Total query attention heads. For Qwen3-0.6B: 16.

head_dim

int

required

Attention head dimension. For Qwen3-0.6B: 128.

num_kv_heads

int

required

Key/value head count (GQA). For Qwen3-0.6B: 8.

intermediate_size

int

required

MLP hidden dimension. For Qwen3-0.6B: 3072.

num_layers

int

required

Number of transformer decoder layers. For Qwen3-0.6B: 28.

tie_word_embeddings

bool

required

Whether to share the embedding and LM head weights. For Qwen3-0.6B: True.

base

int

required

RoPE base frequency. For Qwen3-0.6B: 1000000.

rms_norm_epsilon

float

required

Epsilon for RMSNorm layers. For Qwen3-0.6B: 1e-6.

qkv_bias

bool

required

Whether QKV projections include bias. For Qwen3-0.6B: False.

scale

float

required

Attention scale multiplier. Typically 1.0.

max_position

int

required

Maximum position index for the RoPE cache. Must be >= max_model_length. For Qwen3-0.6B: 32768.

ffn_bias

bool

required

Whether MLP projections include bias. For Qwen3-0.6B: False.

Methods

`generate`

llm.generate(prompts: list[str], sampling_params: SamplingParams) -> dict

Tokenizes every prompt, adds all sequences to the scheduler, then calls step() in a loop until every sequence has finished. Returns a dict with decoded text and raw token IDs, sorted in the same order as the input prompts.

Parameters

prompts

list[str]

required

List of plain-text prompt strings. Each string is encoded with the model’s tokenizer before being submitted to the scheduler.

sampling_params

SamplingParams

required

Sampling configuration applied to every prompt in this batch. See SamplingParams for field details.

Return value

text

list[str]

Decoded completion strings, one per input prompt, in the same order.

token_ids

list[list[int]]

Raw completion token IDs (prompt tokens excluded), one list per input prompt.

`add_prompt`

llm.add_prompt(prompt: str, sampling_params: SamplingParams) -> None

Tokenizes prompt and enqueues it as a new Sequence in the scheduler’s waiting queue. Use this method together with step() when you need fine-grained control over the generation loop.

prompt

str

required

Plain-text prompt string.

sampling_params

SamplingParams

required

Sampling configuration for this sequence.

`step`

llm.step() -> tuple[list[tuple[int, list[int]]], int, bool]

Runs one scheduling + forward-pass iteration:

Calls Scheduler.schedule() to select the next batch and determine whether it is a prefill or a decode step.
Calls ModelRunner.run() via IPC to execute the forward pass and sample one token per sequence.
Calls Scheduler.postprocess() to append sampled tokens and check stopping conditions.
Returns metadata for sequences that finished during this step.

Return value

outputs

list[tuple[int, list[int]]]

List of (seq_id, completion_token_ids) pairs for sequences that finished in this step. Empty when no sequence finished.

num_processed_tokens

int

During prefill: total tokens in the scheduled sequences. During decode: number of sequences stepped (one token each).

is_prefill

bool

True when the step processed a prefill batch; False for a decode batch.

`exit`

llm.exit() -> None

Gracefully shuts down the engine. Sends an "exit" IPC call to all worker processes, deletes the ModelRunner, and joins worker processes. This method is also registered with atexit and called automatically when the Python process exits.

Usage example

from myvllm.engine.llm_engine import LLMEngine as LLM
from myvllm.sampling_parameters import SamplingParams
from transformers import AutoTokenizer

config = {
    # Scheduling and memory
    "model_name_or_path": "Qwen/Qwen3-0.6B",
    "world_size": 1,
    "max_num_sequences": 16,
    "max_num_batched_tokens": 1024,
    "max_cached_blocks": 1024,
    "block_size": 256,
    "eos": 151645,
    "enforce_eager": True,
    "gpu_memory_utilization": 0.9,
    "max_num_batch_tokens": 4096,
    "max_model_length": 128,
    # Qwen3-0.6B model architecture
    "vocab_size": 151936,
    "hidden_size": 1024,
    "num_heads": 16,
    "head_dim": 128,
    "num_kv_heads": 8,
    "intermediate_size": 3072,
    "num_layers": 28,
    "tie_word_embeddings": True,
    "base": 1000000,
    "rms_norm_epsilon": 1e-6,
    "qkv_bias": False,
    "scale": 1,
    "ffn_bias": False,
    "max_position": 32768,
}

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
llm = LLM(config=config)

sampling_params = SamplingParams(temperature=0.6, max_tokens=256, max_model_length=128)

prompts = [
    "introduce yourself",
    "list all prime numbers within 100",
]
# Apply chat template before passing to the engine
prompts = [
    tokenizer.apply_chat_template(
        [{"role": "user", "content": p}],
        tokenize=False,
        add_generation_prompt=True,
    )
    for p in prompts
]

outputs = llm.generate(prompts, sampling_params)

for prompt, text in zip(prompts, outputs["text"]):
    print(f"Prompt:     {prompt}")
    print(f"Completion: {text}")

​Constructor

​Config parameters

​Scheduling and memory

​Model architecture

​Methods

​generate

​Parameters

​Return value

​add_prompt

​step

​Return value

​exit

​Usage example

Constructor

Config parameters

Scheduling and memory

Model architecture

Methods

`generate`

Parameters

Return value

`add_prompt`

`step`

Return value

`exit`

Usage example