Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Group Tree Optimization (GTO) is a framework designed to address draft policy misalignment in speculative decoding. While standard methods optimize for a single greedy path, GTO aligns training with the actual tree-based decoding policy used during inference. This is achieved through a Draft Tree Reward objective and a stable Group-based Draft Policy Training scheme.

Performance

GTO achieves state-of-the-art acceleration for LLM inference:

  • 5.6x faster than vanilla autoregressive decoding.
  • 7% faster than previous state-of-the-art methods like EAGLE-3.

Usage

To use this model for accelerated inference, please follow the setup instructions in the official GTO repository.

Inference via Web UI

The codebase provides a web interface for testing the acceleration. After setting up the environment and cloning the repo, you can run:

python -m application.webui --ea-model-path [path of GTO weight] \ 
    --base-model-path [path of the original model] \
    --model-type [vicuna\llama3\qwen] \
    --total-token [int]

The total-token parameter represents the number of draft tokens. Adjusting this based on your specific device and model can achieve better results.

Citation

If you find this work useful, please cite:

@article{hu2025bridging,
  title={Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding},
  author={Hu, Shijing and Li, Jingyang and Lu, Zhihui and Zhou, Pan},
  journal={arXiv preprint arXiv:2509.22134},
  year={2025}
}

Acknowledgements

The implementation is based on the open-source repository of EAGLE. This project has been influenced by many projects in the LLM community, such as HASS and GRIFFIN.

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for husj576/GTO-deepseek-8B