Skip to main content
miniVLLM is a minimal, readable implementation of the vLLM LLM inference engine. Built on top of Nano-vLLM, it features fully self-contained custom Triton kernels for both paged attention (decode) and flash attention (prefill), making it an ideal resource for learning how production LLM serving systems work — and for running them.

Quick Start

Run your first inference in under 5 minutes with a working code example.

Installation

Install miniVLLM and its dependencies with uv.

Core Concepts

Understand paged attention, flash attention, KV caching, and scheduling.

API Reference

Full reference for LLMEngine, SamplingParams, and all public APIs.

What is miniVLLM?

miniVLLM implements the full LLM inference pipeline from scratch, including:
  • Custom Triton kernels — paged attention for decode, flash attention (O(N) memory) for prefill
  • Paged KV cache — memory-efficient KV cache management with prefix caching
  • Iteration-level scheduler — prefill-first scheduling with preemption support
  • Multi-GPU tensor parallelism — distributed inference via NCCL
  • CUDA graph optimization — low-latency decode via captured replay graphs
The codebase is designed to be readable and educational. Each component maps directly to a concept in modern LLM serving.

Getting started

1

Install dependencies

Install uv and sync the project:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
2

Run the inference demo

Execute the main inference engine demo using Qwen3:
uv run python main.py
3

Run benchmarks

Compare attention implementations across prefill and decode phases:
uv run python benchmark_prefilling.py
uv run python benchmark_decoding.py
4

Explore the architecture

Read the Architecture Guide to understand how each component fits together, or follow the step-by-step implementation guide.

Explore by topic

Paged Attention

How KV cache is managed in fixed-size pages to eliminate fragmentation.

Flash Attention

O(N) memory attention via online softmax, implemented in Triton.

Scheduling

Iteration-level prefill/decode scheduling with preemption.

Multi-GPU

Tensor parallelism across GPUs using NCCL all-reduce.

Benchmarks

Comparative benchmarks of PyTorch, Triton, and Flash Attention.

Models

Qwen3 and Llama 3.2 implementations built on parallel layers.