SamplingParams is a Python dataclass that bundles the generation hyperparameters for a single inference request. One SamplingParams instance is attached to every Sequence and consulted by the Scheduler to determine when to stop generation.
Definition
Fields
Softmax temperature applied before multinomial sampling. Higher values
increase randomness; lower values make the distribution more peaked.
Maximum number of completion tokens to generate. The prompt tokens are
not counted. Generation stops as soon as
num_completion_tokens >= max_tokens,
regardless of whether EOS has been sampled.When
False (default), generation stops as soon as the EOS token ID
configured in LLMEngine is sampled. When True, EOS tokens are treated
as ordinary tokens and generation continues until max_tokens or
max_model_length is reached.Maximum total sequence length, counting both prompt and completion tokens.
When set, generation stops as soon as
num_tokens >= max_model_length.
None means no total-length limit (only max_tokens applies).Stopping conditions
TheScheduler checks three independent stopping conditions after each decode step. Generation halts as soon as any condition is satisfied:
| Condition | Field | Triggered when |
|---|---|---|
| EOS token | ignore_eos | token_id == eos and ignore_eos is False |
| Completion limit | max_tokens | num_completion_tokens >= max_tokens |
| Total length limit | max_model_length | max_model_length is not None and num_tokens >= max_model_length |
Validation
SamplingParams validates its fields in __post_init__, which runs automatically on construction: