Missing 'reasoning' field in response when serving gemma-4-31B-it with vLLM

#28

by Iann - opened 2 days ago

Hi
I am currently trying to serve the newly released google/gemma-4-31B-it model using vLLM.
However, I noticed that the reasoning field (response.choices[0].message.reasoning) is empty or missing in the API response.

Here is my environment setup:

Hardware: NVIDIA H200
vLLM version: 0.19.0
Transformers version: 5.5.0

I started the vLLM server using the following command:

CUDA_VISIBLE_DEVICES=0 vllm serve "google/gemma-4-31B-it"  \
  --port 5000 \
  --reasoning-parser gemma4 \
  --served-model-name "gemma-4-31B-it" \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.9 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --quantization fp8 \
  --kv-cache-dtype fp8_e4m3

My Questions:

Is it expected behavior that the reasoning output is missing with this current setup?
Are there any additional configurations, specific prompt templates, or flags required to properly extract and output the reasoning tokens?

Any guidance or suggestions would be greatly appreciated. Thank you!

wonderboy

2 days ago

Hey buddy, while I haven't tried this yet, I did take an earlier glimpse at the vLLM docs regarding Gemma 4:

https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#thinking-reasoning-mode

# The thinking process is in reasoning_content
if hasattr(message, "reasoning_content") and message.reasoning_content:
    print("=== Thinking ===")
    print(message.reasoning_content)

print("\n=== Answer ===")
print(message.content)

The fact that they explicitly check whether reasoning_content exists and is populated strongly suggests it is not included by default for every response. If it were always present, there would be no need for that conditional. So most likely you have to explicitly enable or toggle the reasoning mode to get it.

Iann

1 day ago

Thank you for the comment, @wonderboy !

I managed to solve the issue where the reasoning field was returning None and the thought process was being merged into the main content field.

After checking this GitHub issue (https://github.com/vllm-project/vllm/issues/38855),
I confirmed that the problem can be resolved by explicitly passing "skip_special_tokens": False inside the extra_body of the API request.

Here is a simple, working example using the OpenAI Python client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5000/v1", api_key="EMPTY")
MODEL_NAME = "gemma-4-31B-it"

response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Explain what a black hole is in simple terms."}
    ],
    temperature=0,     
    max_tokens=2048,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "skip_special_tokens": False  # <-- This is the key to fixing the parsing issue!
    }
)

print("=== Reasoning Content ===")
print(response.choices[0].message.reasoning)

print("\n=== Final Answer Content ===")
print(response.choices[0].message.content)

By adding "skip_special_tokens": False, the output is now perfectly split into message.reasoning and message.content as expected.

I hope this helps anyone else who is trying to serve Gemma-4 with reasoning enabled!

srikanta-221

Google org about 23 hours ago

Hey,

Thank you for following up and for sharing your working solution! You have identified exactly the right approach.

To clarify why this is necessary: by default, many generation pipelines strip special tokens before returning the final output. Gemma 4 uses special control tokens to delimit reasoning segments. If skip_special_tokens is left as default, those boundaries are removed, causing reasoning to merge into the main content.

Setting "skip_special_tokens": False preserves those delimiters, allowing the --reasoning-parser gemma4 flag to correctly map reasoning to message.reasoning while keeping the final answer in message.content.

Also, enabling "chat_template_kwargs" : {"enable_thinking": True} ensures that the model actually generates reasoning tokens.

This thread and your code snippet should serve as a clear reference for anyone deploying Gemma 4 with reasoning enabled. Thanks again for sharing!

GokhanAI

about 18 hours ago

Hey everyone! 👋

I'm trying to get the reasoning/thinking process to work with this model using vLLM, but I'm running into an issue where the reasoning output is completely empty.

I'm starting my vLLM server with the --reasoning-parser gemma4 flag like this:

CUDA_VISIBLE_DEVICES=0 vllm serve /opt/Inference/MODELS/gemma-4-26B-A4B-it/ \
  --max-model-len 32000 \
  --host 10.12.141.19 \
  --port 9010 \
  --max-num-batched-tokens 1024 \
  --max-num-seqs 32 \
  --tensor-parallel-size 1 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --dtype bfloat16 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --performance-mode interactivity \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4

And I'm using the OpenAI Python client, passing enable_thinking: True in the extra_body:

from openai import OpenAI

client = OpenAI(base_url="http://10.12.141.19:9010/v1", api_key="123")

response = client.chat.completions.create(
    model="/opt/Inference/MODELS/gemma-4-26B-A4B-it/",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "A snail is at the bottom of a 20-foot well. Each day it climbs 3 feet, but at night it slides back 2 feet. How many days will it take to reach the top?"}
    ],
    max_tokens=4096,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "skip_special_tokens": False
    }
)

print(response.choices[0].message.reasoning) # This comes back empty/None!
print(response.choices[0].message.content) # The final answer prints fine.

The final answer generates perfectly, but response.choices[0].message.reasoning (or reasoning_content) is just empty. Has anyone else experienced this or knows what I might be missing here? Any help would be appreciated! 🙏

sreesdas

about 13 hours ago

•

edited about 13 hours ago

Is anyone able to receive reasoning content during streaming?
No matter what I tried, I could not get this code to output reasoning content.
With the same code, setting stream=False does return reasoning_content.

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "what is 1+1"}
    ],
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "skip_special_tokens": False,
    },
    stream=True,
)

Drixpy

about 5 hours ago

maybe DONT use vLLM

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment