Skip to main content
SamplingParams is a Python dataclass that bundles the generation hyperparameters for a single inference request. One SamplingParams instance is attached to every Sequence and consulted by the Scheduler to determine when to stop generation.

Definition

from dataclasses import dataclass

@dataclass
class SamplingParams:
    temperature: float = 1.0
    max_tokens: int = 64
    ignore_eos: bool = False
    max_model_length: int | None = None

    def __post_init__(self):
        assert self.temperature > 1e-10, "greedy sampling is not permitted"

Fields

temperature
float
default:"1.0"
Softmax temperature applied before multinomial sampling. Higher values increase randomness; lower values make the distribution more peaked.
Greedy decoding (temperature = 0) is not supported. Setting temperature to any value <= 1e-10 raises an AssertionError in __post_init__.
# Conservative creative writing
SamplingParams(temperature=0.6)

# More varied outputs
SamplingParams(temperature=1.2)
max_tokens
int
default:"64"
Maximum number of completion tokens to generate. The prompt tokens are not counted. Generation stops as soon as num_completion_tokens >= max_tokens, regardless of whether EOS has been sampled.
# Generate at most 256 new tokens
SamplingParams(temperature=0.8, max_tokens=256)
ignore_eos
bool
default:"false"
When False (default), generation stops as soon as the EOS token ID configured in LLMEngine is sampled. When True, EOS tokens are treated as ordinary tokens and generation continues until max_tokens or max_model_length is reached.
max_model_length
int | None
default:"None"
Maximum total sequence length, counting both prompt and completion tokens. When set, generation stops as soon as num_tokens >= max_model_length. None means no total-length limit (only max_tokens applies).
# Hard cap at 512 total tokens (prompt + completion)
SamplingParams(temperature=0.7, max_tokens=256, max_model_length=512)

Stopping conditions

The Scheduler checks three independent stopping conditions after each decode step. Generation halts as soon as any condition is satisfied:
ConditionFieldTriggered when
EOS tokenignore_eostoken_id == eos and ignore_eos is False
Completion limitmax_tokensnum_completion_tokens >= max_tokens
Total length limitmax_model_lengthmax_model_length is not None and num_tokens >= max_model_length

Validation

SamplingParams validates its fields in __post_init__, which runs automatically on construction:
def __post_init__(self):
    assert self.temperature > 1e-10, "greedy sampling is not permitted"
Greedy decoding is explicitly unsupported. Passing temperature=0 raises:
AssertionError: greedy sampling is not permitted
Use a small positive temperature such as 0.01 instead if near-deterministic output is needed.

Usage examples

from myvllm.sampling_parameters import SamplingParams

# Defaults: temperature=1.0, max_tokens=64, no length cap
params = SamplingParams()