VLLM

Create chat completion (OpenAI-compatible)

post

Create a chat completion using vLLM (OpenAI-compatible API).

Available Models: openai/gpt-oss-120b

Required Parameters

Parameter
Type
Description

model

string

Model ID: openai/gpt-oss-120b

messages

array

Array of message objects with role and content

Optional Parameters

Parameter
Type
Default
Description

reasoning_effort

string

-

Reasoning depth: low, medium, high. Omit to use model default.

include_reasoning

boolean

-

Include reasoning in response. Omit to use model default.

max_completion_tokens

integer

-

Max tokens to generate (preferred)

max_tokens

integer

-

Max tokens (deprecated, use max_completion_tokens)

min_tokens

integer

-

Minimum tokens before stopping

temperature

float

-

Sampling temperature (0.0-2.0)

top_p

float

-

Nucleus sampling threshold (0.0-1.0)

top_k

integer

-

Top-k sampling (-1 to disable)

min_p

float

-

Minimum probability threshold (0.0-1.0)

frequency_penalty

float

-

Penalize frequent tokens (-2.0 to 2.0)

presence_penalty

float

-

Penalize repeated topics (-2.0 to 2.0)

repetition_penalty

float

-

Repetition penalty (1.0 = none)

n

integer

-

Number of completions (1-10)

stop

string/array

-

Stop sequence(s)

seed

integer

-

Random seed for reproducibility

logprobs

boolean

-

Return log probabilities

top_logprobs

integer

-

Number of top logprobs (0-20)

logit_bias

object

-

Token bias mapping

user

string

-

End-user identifier

Async Mode

Set async: true to get a task_address immediately and poll for results.

Header parameters
x-api-keyany ofOptional
stringOptional
or
nullOptional
Body

OpenAI-compatible chat completion request for vLLM.

Supports standard OpenAI API parameters plus vLLM-specific extensions. Parameters are passed directly to the vLLM provider without transformation.

modelstringRequired

Model identifier. Available: openai/gpt-oss-120b

Example: openai/gpt-oss-120b
max_tokensany ofOptional

Maximum tokens to generate (deprecated, use max_completion_tokens)

integer · min: 1 · max: 32768Optional
or
nullOptional
max_completion_tokensany ofOptional

Maximum tokens to generate (preferred over max_tokens)

integer · min: 1 · max: 32768Optional
or
nullOptional
min_tokensany ofOptional

Minimum tokens to generate before stopping

integerOptional
or
nullOptional
temperatureany ofOptional

Sampling temperature (0.0 = deterministic)

number · max: 2Optional
or
nullOptional
top_pany ofOptional

Top-p (nucleus) sampling

number · max: 1Optional
or
nullOptional
top_kany ofOptional

Top-k sampling (-1 to disable)

integer · min: -1Optional
or
nullOptional
min_pany ofOptional

Minimum probability threshold for sampling

number · max: 1Optional
or
nullOptional
frequency_penaltyany ofOptional

Frequency penalty for token repetition

number · min: -2 · max: 2Optional
or
nullOptional
presence_penaltyany ofOptional

Presence penalty for topic repetition

number · min: -2 · max: 2Optional
or
nullOptional
repetition_penaltyany ofOptional

Repetition penalty (1.0 = no penalty)

numberOptional
or
nullOptional
nany ofOptional

Number of completions to generate

integer · min: 1 · max: 10Optional
or
nullOptional
stopany ofOptional

Stop sequence(s) - generation stops when encountered

stringOptional
or
string[]Optional
or
nullOptional
seedany ofOptional

Random seed for reproducibility

integerOptional
or
nullOptional
streamany ofOptional

Enable streaming responses (not yet supported)

Default: false
booleanOptional
or
nullOptional
reasoning_effortany ofOptional

Reasoning effort level: 'low', 'medium', 'high'. Omit to use model default.

string · enumOptionalPossible values:
or
nullOptional
include_reasoningany ofOptional

Include reasoning content in response. Omit to use model default.

booleanOptional
or
nullOptional
logprobsany ofOptional

Return log probabilities of output tokens

booleanOptional
or
nullOptional
top_logprobsany ofOptional

Number of most likely tokens to return at each position

integer · max: 20Optional
or
nullOptional
logit_biasany ofOptional

Token ID to bias value mapping (-100 to 100)

or
nullOptional
userany ofOptional

Unique identifier for the end-user

stringOptional
or
nullOptional
modeany ofOptional

Routing mode: 'auto' or 'direct' (vLLM is direct-only)

Default: auto
string · enumOptionalPossible values:
or
nullOptional
asyncany ofOptional

Async mode: returns task_address immediately, poll /v2/tasks/{task_address} for result. Default: false (sync mode).

Default: false
booleanOptional
or
nullOptional
Responses
chevron-right
200

Successful Response

application/json
post
/v2/vllm/v1/chat/completions

List available models (OpenAI-compatible)

get

List all available vLLM models (OpenAI-compatible format).

Returns all active vLLM models. Access control is enforced at request time via tier model_restrictions - this endpoint shows what's available.

Responses
chevron-right
200

Successful Response

application/json
get
/v2/vllm/v1/models
200

Successful Response

Last updated