Llama3-8B-RAMP-4bit
This repository contains a 4-bit quantized Llama 3 8B checkpoint produced with RAMP (Reinforcement Adaptive Mixed Precision Quantization).
Paper
RAMP was introduced in:
RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
Model Summary
This model is a compressed Llama 3 8B variant intended for efficient inference with reduced memory usage.
What is RAMP?
RAMP is a reinforcement learning based mixed-precision quantization method that learns per-layer bit-width assignments under a global budget. It also introduces Scale Folding, a preconditioning step designed to make sub-4-bit quantization more stable.
Intended Use
This model is intended for:
- efficient local inference
- edge and on-device deployment
- research on quantization and mixed-precision inference
Limitations
- This is a quantized model and may show quality degradation compared to the original FP16 model.
- Performance depends on the inference backend, calibration setup, and prompt type.
- The model may still produce incorrect, biased, or unsafe outputs.
Citation
If you use this model or the RAMP method in your work, please cite:
@misc{gautam2026ramp,
title={RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference},
author={Gautam, Arpit Singh and Jha, Saurabh},
year={2026},
eprint={2603.17891},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
- Downloads last month
- -
We're not able to determine the quantization variants.
Model tree for ArpitSinghGautam/Llama3-8B-RAMP-4bit
Base model
meta-llama/Meta-Llama-3-8B