YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DocEdit Qwen2.5-3B Checkpoints
This repository contains the checkpoint artifacts for a DocEdit Game experiment built on top of:
- base model:
Qwen/Qwen2.5-3B-Instruct - hardware:
1x H200 SXM - training recipe:
SFT -> GRPO - date:
April 17, 2026
Primary Hub repo:
What This Repository Is
This repo stores the checkpoints, metrics, and supporting notes for a narrow structured document-repair experiment.
The task is:
- read a corrupted Word-style structured document
- read an edit instruction
- repair the intended corruptions
- preserve the rest of the document
- minimize collateral damage
This is a research/demo repo, not a production release.
Base Model
Qwen/Qwen2.5-3B-Instruct
We chose this model because it is:
- small enough to fine-tune quickly on a single H200
- large enough to show meaningful task adaptation
- practical for LoRA-based experimentation
Training Setup
We used a two-stage setup:
SFT
- supervised fine-tuning on paired corrupted -> repaired document examples
- implemented as a LoRA adapter
GRPO
- reinforcement learning from the DocEdit verifier reward
- continued from the SFT LoRA adapter
- still LoRA-based, not full-model fine-tuning
Important note:
This run is a rewrite-policy baseline:
- the model outputs a repaired document
- it does not yet implement the final frontier-planner -> applicator architecture we discussed later
That means this repo should be interpreted as:
- a strong small-model baseline artifact
- a checkpoint series we can compare against future tool-policy work
What Is In This Repo
qwen25_3b_sft/
LoRA adapter after supervised fine-tuning.
This stage teaches:
- document format discipline
- markup preservation
- the basic corrupted -> repaired mapping
qwen25_3b_grpo/
LoRA adapter after GRPO, plus intermediate checkpoints.
This stage optimizes for:
- verifier reward
- similarity improvement
- reduced collateral damage
- output-format obedience
metrics/
This folder contains:
- smoke eval outputs
- presentation-oriented metrics summaries
docs/
This folder contains explanatory notes and walkthrough material used to present the project.
Training Data
The training data was generated from the DocEdit benchmark pipeline.
Each task includes:
doc_seedcorruption_seeddifficultydomain- a corrupted source document
- a target repaired document
- corruption metadata
Domains include:
- legal
- pharmaceutical
Corruptions include:
- spelling
- casing
- punctuation
- content deletion/insertion
- formatting loss
- PDF artifact cleanup
- junk character cleanup
SFT Summary
Purpose
Teach the model the core repair pattern before RL.
Result
- hardware:
1x H200 - runtime: about
109.38s - final train loss: about
0.06346 - final mean token accuracy: about
0.98954
Artifact
- LoRA adapter size: about
119.8 MB
GRPO Summary
Purpose
Use the game verifier as a reward signal and continue training from the SFT adapter.
Result
- runtime: about
5562.75s - total steps:
100 - final logged train loss: about
0.03506 - final logged reward at step
100: about0.79567
Intermediate checkpoints written:
checkpoint-25checkpoint-50checkpoint-75checkpoint-100
Final GRPO adapter artifact:
adapter_model.safetensorsabout239.5 MB
Important interpretation
GRPO showed that the RL loop works end-to-end on the H200 and produces a complete adapter plus checkpoint trail.
This does not by itself prove that the rewrite-policy model is the best final product design.
Instead, it gives us:
- a trained small-model RL baseline
- a concrete artifact to compare against frontier-model tool use
- a launch point for future tool-policy or planner/executor designs
Current Evaluation Artifacts
The repo includes small smoke evaluation outputs for:
- base
Qwen2.5-3B-Instruct - SFT LoRA adapter
At the time of upload, these smoke evals are intentionally small and should be treated as sanity checks, not final benchmark conclusions.
The purpose is to show:
- that the checkpoints load
- that the evaluation path works
- that future comparisons can be made reproducibly
What The Model Currently Does
Current behavior:
- takes a corrupted structured document plus instruction
- outputs a repaired document directly
This is useful for:
- baseline benchmarking
- fast experimentation
- demonstrating that a small model can learn the task format
This is not yet the final “frontier planner + applicator executor” system.
Why This Repo Still Matters
Even though our later design discussion moved toward a more structured frontier-planner -> applicator setup, this repo remains useful because it captures:
- a reproducible small-model baseline
- a completed H200 SFT run
- a completed H200 GRPO run
- concrete weights, metrics, and checkpoints
- a reference point for future tool-policy work
In other words:
This repo answers the question:
Can a small open model be adapted and RL-tuned on this document repair task at all?
The answer is yes.
Known Limitations
This repo has several important limitations:
- The current policy is a rewrite policy, not a tool-call policy.
- The evaluation uploaded here is still mostly smoke-level, not final large-scale benchmarking.
- The architecture is evolving toward a cleaner frontier-planner -> applicator design.
- The current run used the existing reward and data scaffolding; future versions may use a better patch language or tool trajectory format.
Recommended Next Step
The next research step is not “train a bigger rewrite model.”
The better next step is:
- let a frontier model such as
GPT-5.4drive a structured edit or tool language directly - collect those successful traces
- compare cost and quality against this rewrite-policy baseline
- decide whether to train a smaller applicator model from those traces
That makes this repo a baseline artifact for future comparison.
Files You May Want To Inspect
qwen25_3b_sft/qwen25_3b_grpo/metrics/presentation_metrics.jsonmetrics/qwen25_3b_base_smoke.jsonlmetrics/qwen25_3b_sft_smoke.jsonldocs/DOCEDIT_STUDENT_WALKTHROUGH.md
Usage Note
These checkpoints are intended for:
- experimentation
- evaluation
- presentation/demo purposes
They should not be treated as production-ready legal or pharmaceutical editing systems without a much more complete evaluation program.