Qwen3-8B-GSM8K-Synth-50K

A Qwen3-8B model fine-tuned on 50,418 synthetic grade-school math problems using QLoRA, designed for step-by-step mathematical reasoning with chain-of-thought.

What This Model Does

Given a math word problem, the model produces a structured reasoning chain inside <think> tags, then outputs the final numerical answer.

Example

Input:

If 3x + 7 = 22, what is x?

Output:

<think>
Step 1: Subtract 7 from both sides: 3x = 22 - 7
Step 2: Calculate: 3x = 15
Step 3: Divide both sides by 3: x = 15 / 3
Step 4: Calculate: x = 5
</think>

The answer is 5.0.

Evaluation Results

Evaluated on the full GSM8K test set (1,319 questions) with greedy decoding and 4-bit quantization.

Model GSM8K Accuracy Correct / Total Time
Base Qwen3-8B 79.4% 1,047 / 1,319 121.8m
Qwen3-8B-GSM8K-Synth-50K 86.2% 1,137 / 1,319 45.1m

Fine-tuning improvement: +6.8 percentage points over the base model.

The fine-tuned model also runs ~2.7x faster at inference due to shorter, more structured outputs (the base model produces verbose markdown formatting while the fine-tuned model outputs concise step-by-step solutions).

Cross-Model Comparison

Model Params Training Data GSM8K Accuracy vs Base
Base Qwen3-4B 4B 74.7%
Qwen3-4B-GSM8K-Synth-35K 4B 35K synthetic 85.0% +10.3%
Base Qwen3-8B 8B 79.4%
Qwen3-8B-GSM8K-Synth-50K 8B 50K synthetic 86.2% +6.8%

Key takeaways:

  • Synthetic data fine-tuning provides a substantial accuracy boost at both model scales (+10.3% for 4B, +6.8% for 8B)
  • The 8B fine-tuned model achieves the highest absolute accuracy (86.2%)
  • Scaling from 4B to 8B improves base performance by +4.7% and fine-tuned performance by +1.2%

Training Details

Base Model & Method

Parameter Value
Base model Qwen/Qwen3-8B
Method QLoRA (4-bit NF4 quantization)
Framework Unsloth + HuggingFace TRL
Merge Fully merged to 16-bit (no adapter needed at inference)

QLoRA Configuration

Parameter Value
LoRA rank 16
LoRA alpha 16
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Dropout 0
Trainable parameters 43.6M / 8.23B (0.53%)

Training Hyperparameters

Parameter Value
Epochs 3
Batch size 1 (per device)
Gradient accumulation 64 (effective batch = 64)
Learning rate 2e-4 (cosine schedule)
Warmup steps 10
Optimizer AdamW 8-bit
Precision bf16
Max sequence length 1024
Max grad norm 1.0
Seed 42
Total steps 2,364

Memory Optimizations (fitting 8B in 12GB VRAM)

Training Qwen3-8B in 4-bit still uses ~7.2GB for weights alone, leaving only ~4.4GB on a 12GB GPU. The following optimizations made training possible:

  • Embedding offloading (offload_embedding=True) — input embeddings kept on CPU
  • Chunked fused CE loss (UNSLOTH_CE_LOSS_N_CHUNKS=8) — splits the large 151,936-vocab logits computation into smaller chunks
  • Unsloth gradient checkpointing — auto-offloads activations for long sequences
  • Reduced sequence length (1024 vs 4096) — data is short (median ~100 tokens, max ~260)
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True — reduces CUDA memory fragmentation
  • Peak VRAM usage: 11.8GB / 12.3GB (96%)

Training Loss Curve

Epoch 1: 0.735 → 0.304  (rapid descent)
Epoch 2: 0.292 → 0.277  (steady refinement)
Epoch 3: 0.271 → 0.266  (final polish)

Final training loss: 0.302 (avg over 3 epochs)
Milestone Loss Epoch
Step 50 0.735 0.06
Step 250 0.333 0.32
Step 500 0.316 0.63
Step 788 (Epoch 1) 0.304 1.02
Step 1000 0.292 1.27
Step 1250 0.291 1.59
Step 1576 (Epoch 2) 0.277 2.03
Step 1750 0.270 2.22
Step 2000 0.266 2.54
Step 2364 (Epoch 3) 0.266 2.98

Comparison with 4B Model

Metric Qwen3-4B (35K) Qwen3-8B (50K)
Training data 34,818 examples 50,418 examples
Final loss 0.291 0.266
LoRA rank 32 16
Training time 3h 18m 9h 25m
Peak VRAM ~8.1 GB ~11.8 GB

Hardware & Time

Metric Value
GPU NVIDIA RTX 4070 SUPER (12GB VRAM)
Training time 9h 25m (33,890 seconds)
Throughput 4.46 samples/sec, 0.07 steps/sec
Peak VRAM ~11.8 GB

Training Data

Trained on the full 50,418 examples from clarkkitchen22/SynthGSM8K-50K — a synthetic grade-school math dataset generated by Claude Haiku 4.5 via Anthropic's Batch API, then filtered through an 8-stage quality pipeline.

Data Format

Each training example follows the Qwen3 ChatML format with thinking tags:

<|im_start|>user
{math word problem}<|im_end|>
<|im_start|>assistant
<think>
{step-by-step solution}
</think>

The answer is {number}.<|im_end|>

GSM8K-style calculation annotations (e.g., <<24*3=72>>) are stripped from solutions during preprocessing.

Dataset Highlights

  • 50,418 problems — full dataset used for this training run
  • Generated via few-shot prompting from 200 real GSM8K seed problems
  • 8-stage filter pipeline: structure, answer range, solution quality, AI detection, math verification, exact dedup, fuzzy dedup (TF-IDF @ 0.85), seed overlap
  • Average 3.0 math operations per solution
  • 92.6% integer answers, range 0-225,000
  • ~$55 generation cost (Haiku 4.5 Batch API at 50% discount)

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "clarkkitchen22/Qwen3-8B-GSM8K-Synth-50K"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [{"role": "user", "content": "A store sells apples for $2 each and oranges for $3 each. If Sarah buys 5 apples and 4 oranges, how much does she spend?"}]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.6, top_p=0.95)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Answer Extraction

import re

def extract_answer(text):
    """Extract numerical answer from model output."""
    match = re.search(r"answer\s*(?:is|:)\s*([-\d,]+\.?\d*)", text, re.IGNORECASE)
    if match:
        return float(match.group(1).replace(",", ""))
    matches = re.findall(r"([-\d,]+\.?\d+)", text)
    return float(matches[-1].replace(",", "")) if matches else None

Intended Use

  • Math tutoring: Step-by-step solutions to grade-school math problems
  • Research: Studying the effect of model scale and synthetic data on math reasoning
  • Distillation baseline: Comparing synthetic-data-trained small models against larger models
  • Further fine-tuning: Starting point for domain-specific math reasoning tasks

Limitations

  • Trained on synthetic data generated by Haiku 4.5 — bounded by that model's math ability
  • Optimized for GSM8K-style word problems (arithmetic, basic algebra) — not calculus, geometry, or advanced math
  • All training answers are non-negative; may struggle with problems requiring negative answers
  • Solutions use a specific <think> tag format — other prompting styles may give worse results
  • Evaluated on GSM8K only — performance on other math benchmarks (MATH, MMLU-Math) not yet tested

How It Was Built

End-to-End Pipeline

200 GSM8K seeds → Claude Haiku 4.5 (Batch API) → 83K raw problems
    → 8-stage filter → 50K clean dataset → QLoRA fine-tune Qwen3-8B
    → Merge to 16-bit → Push to HuggingFace

Pipeline Code

The full data generation pipeline and training code is available at: github.com/goldbar123467/SynthDataGSM8K

Citation

@model{qwen3_8b_gsm8k_synth_50k,
  title={Qwen3-8B-GSM8K-Synth-50K},
  author={clarkkitchen22},
  year={2026},
  base_model={Qwen/Qwen3-8B},
  training_data={clarkkitchen22/SynthGSM8K-50K},
  url={https://huggingface.co/clarkkitchen22/Qwen3-8B-GSM8K-Synth-50K}
}

Acknowledgements

Downloads last month
9
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for clarkkitchen22/Qwen3-8B-GSM8K-Synth-50K

Finetuned
Qwen/Qwen3-8B
Finetuned
(1444)
this model

Dataset used to train clarkkitchen22/Qwen3-8B-GSM8K-Synth-50K

Evaluation results