File size: 5,722 Bytes
b7efbc7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | # GIS-Coder 7B β Training Package
Fine-tune **Qwen2.5-Coder-7B-Instruct** into a GIS code specialist using QLoRA SFT.
## π What's Included
| File | Description |
|------|-------------|
| `train_7b.py` | Production training script with CLI args |
| `evaluate.py` | Evaluation on 12 GIS benchmarks with scoring |
| `requirements.txt` | All dependencies |
Dataset: [`RhodWeo/gis-code-instructions`](https://huggingface.co/datasets/RhodWeo/gis-code-instructions) β 70 expert-curated GIS code examples covering 13 Python libraries.
## π Quick Start
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
### 2. Login to HuggingFace
```bash
huggingface-cli login
```
### 3. Train (single GPU)
```bash
# Default settings (recommended for A100 80GB)
python train_7b.py
# A10G / RTX 4090 (24GB) β reduce batch size
python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048
# H100 β can afford larger batch and sequence length
python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192
# Full precision LoRA (no quantization, needs ~30GB)
python train_7b.py --no_quantize --batch_size 1
# With Flash Attention (faster, needs flash-attn installed)
python train_7b.py --use_flash_attn
# With Trackio monitoring
python train_7b.py --use_trackio --trackio_project my-gis-coder
```
### 4. Multi-GPU
```bash
accelerate launch --num_processes 2 train_7b.py --batch_size 2 --grad_accum 4
```
### 5. Evaluate
```bash
# Evaluate fine-tuned model
python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B
# Compare with base model
python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B --compare_base
# Evaluate local checkpoint
python evaluate.py --adapter_id ./gis-coder-7b-output/final
```
## βοΈ Hyperparameter Guide
### Recommended defaults (battle-tested recipe):
| Parameter | Value | Source |
|-----------|-------|--------|
| `--lr` | `2e-4` | LoRA Without Regret (10Γ base SFT rate) |
| `--lora_r` | `32` | MapCoder-Lite optimal for code tasks |
| `--lora_alpha` | `16` | Ξ±/r = 0.5 |
| `--target_modules` | `all-linear` | LoRA Without Regret |
| `--epochs` | `3` | CFD paper: peak at epoch 2, decline after 4 |
| `--scheduler` | `cosine` | Standard for LoRA |
| `--warmup_ratio` | `0.1` | CFD paper: 10% warmup |
| `--max_length` | `4096` | Covers longest GIS code examples |
### Hardware-specific settings:
| GPU | VRAM | `--batch_size` | `--grad_accum` | `--max_length` | Notes |
|-----|------|----------------|-----------------|----------------|-------|
| RTX 3090 | 24GB | 1 | 16 | 2048 | QLoRA only |
| RTX 4090 | 24GB | 1 | 16 | 2048 | QLoRA, slightly faster |
| A10G | 24GB | 1 | 16 | 2048 | QLoRA only |
| L40S | 48GB | 2 | 8 | 4096 | QLoRA or LoRA |
| A100 40GB | 40GB | 2 | 8 | 4096 | Recommended minimum |
| A100 80GB | 80GB | 2 | 8 | 4096 | Ideal |
| H100 | 80GB | 4 | 4 | 8192 | Fastest |
### Ablation ideas:
```bash
# Higher LoRA rank (more capacity, slower)
python train_7b.py --lora_r 64 --lora_alpha 32
# Lower learning rate (more stable, slower convergence)
python train_7b.py --lr 5e-5
# More epochs (risk overfitting on 70 examples)
python train_7b.py --epochs 5
# Target only attention layers (fewer params, faster)
python train_7b.py --target_modules q_proj,k_proj,v_proj,o_proj
```
## π Expected Results
From our CPU training run with 0.5B base model (70 examples, 3 epochs):
| Metric | Start β End |
|--------|------------|
| Loss | 1.52 β 0.88 (β42%) |
| Token accuracy | 69% β 79% |
| Eval quality score | 85% |
**With the 7B model + QLoRA, expect significantly better results** β the CFD paper achieved 88.7% accuracy with this exact recipe on a similarly-sized domain-specific dataset.
## π Dataset Details
**70 examples** covering 13 GIS Python libraries:
| Library | Examples | Why Important |
|---------|----------|---------------|
| OSMnx | 9 | **All models score 0%** β routing, POIs, isochrones |
| Rasterio | 9 | Satellite imagery, DEM, NDVI, reprojection |
| GeoPandas | 25 | Core: spatial joins, buffering, I/O |
| Shapely | 14 | Geometry operations, validation |
| MovingPandas | 3 | **All models score 0%** β GPS trajectories |
| GDAL | 6 | Raster processing, format conversion |
| PyProj | 2 | CRS handling (critical weakness) |
| H3 | 2 | Hexagonal indexing |
| Folium | 1 | Interactive maps |
| Fiona | 2 | Low-level vector I/O |
| xarray | 1 | Climate/raster datacubes |
| PyQGIS | 1 | Desktop GIS scripting |
| PySAL | 1 | Spatial statistics |
Each example includes:
- System prompt establishing GIS expertise
- Natural language instruction
- Step-by-step Chain-of-Thought reasoning
- Complete, documented Python code
- Key points explaining design decisions
## π¬ Scaling to 20K+ Examples
To maximize quality, use the **OSS-Instruct pattern** (from Magicoder):
1. Crawl GitHub for GIS Python code (`import geopandas`, `import rasterio`, etc.)
2. Use GPT-4o to generate (instruction, solution) pairs from real code snippets
3. Execute and test all generated solutions
4. Add CoT annotations to passing examples (+20.9% pass@1 per CFD paper)
Target: 20Kβ75K examples for production-grade GIS-Coder.
## π References
| Paper | Key Insight |
|-------|-------------|
| [CFD Fine-tuning](https://arxiv.org/abs/2504.09602) | QLoRA SFT recipe: 7B model beats 72B on domain tasks |
| [MapCoder-Lite](https://arxiv.org/abs/2509.17489) | Qwen2.5-Coder-7B best backbone for code LoRA |
| [GIS Benchmark](https://arxiv.org/abs/2410.04617) | All models score 0% on OSMNX/MovingPandas |
| [Magicoder](https://arxiv.org/abs/2312.02120) | OSS-Instruct for synthetic data from real code |
| [LoRA Without Regret](https://arxiv.org/abs/2410.13732) | target all-linear, r=64-256, lr=2e-4 |
|