TRAINING_README.md · RhodWeo/GIS-Coder-7B at main

File size: 5,722 Bytes

b7efbc7

# GIS-Coder 7B — Training Package

Fine-tune **Qwen2.5-Coder-7B-Instruct** into a GIS code specialist using QLoRA SFT.

## 📁 What's Included

| File | Description |
|------|-------------|
| `train_7b.py` | Production training script with CLI args |
| `evaluate.py` | Evaluation on 12 GIS benchmarks with scoring |
| `requirements.txt` | All dependencies |

Dataset: [`RhodWeo/gis-code-instructions`](https://huggingface.co/datasets/RhodWeo/gis-code-instructions) — 70 expert-curated GIS code examples covering 13 Python libraries.

## 🚀 Quick Start

### 1. Install dependencies

```bash
pip install -r requirements.txt
```

### 2. Login to HuggingFace

```bash
huggingface-cli login
```

### 3. Train (single GPU)

```bash
# Default settings (recommended for A100 80GB)
python train_7b.py

# A10G / RTX 4090 (24GB) — reduce batch size
python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048

# H100 — can afford larger batch and sequence length
python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192

# Full precision LoRA (no quantization, needs ~30GB)
python train_7b.py --no_quantize --batch_size 1

# With Flash Attention (faster, needs flash-attn installed)
python train_7b.py --use_flash_attn

# With Trackio monitoring
python train_7b.py --use_trackio --trackio_project my-gis-coder
```

### 4. Multi-GPU

```bash
accelerate launch --num_processes 2 train_7b.py --batch_size 2 --grad_accum 4
```

### 5. Evaluate

```bash
# Evaluate fine-tuned model
python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B

# Compare with base model
python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B --compare_base

# Evaluate local checkpoint
python evaluate.py --adapter_id ./gis-coder-7b-output/final
```

## ⚙️ Hyperparameter Guide

### Recommended defaults (battle-tested recipe):

| Parameter | Value | Source |
|-----------|-------|--------|
| `--lr` | `2e-4` | LoRA Without Regret (10× base SFT rate) |
| `--lora_r` | `32` | MapCoder-Lite optimal for code tasks |
| `--lora_alpha` | `16` | α/r = 0.5 |
| `--target_modules` | `all-linear` | LoRA Without Regret |
| `--epochs` | `3` | CFD paper: peak at epoch 2, decline after 4 |
| `--scheduler` | `cosine` | Standard for LoRA |
| `--warmup_ratio` | `0.1` | CFD paper: 10% warmup |
| `--max_length` | `4096` | Covers longest GIS code examples |

### Hardware-specific settings:

| GPU | VRAM | `--batch_size` | `--grad_accum` | `--max_length` | Notes |
|-----|------|----------------|-----------------|----------------|-------|
| RTX 3090 | 24GB | 1 | 16 | 2048 | QLoRA only |
| RTX 4090 | 24GB | 1 | 16 | 2048 | QLoRA, slightly faster |
| A10G | 24GB | 1 | 16 | 2048 | QLoRA only |
| L40S | 48GB | 2 | 8 | 4096 | QLoRA or LoRA |
| A100 40GB | 40GB | 2 | 8 | 4096 | Recommended minimum |
| A100 80GB | 80GB | 2 | 8 | 4096 | Ideal |
| H100 | 80GB | 4 | 4 | 8192 | Fastest |

### Ablation ideas:

```bash
# Higher LoRA rank (more capacity, slower)
python train_7b.py --lora_r 64 --lora_alpha 32

# Lower learning rate (more stable, slower convergence)
python train_7b.py --lr 5e-5

# More epochs (risk overfitting on 70 examples)
python train_7b.py --epochs 5

# Target only attention layers (fewer params, faster)
python train_7b.py --target_modules q_proj,k_proj,v_proj,o_proj
```

## 📊 Expected Results

From our CPU training run with 0.5B base model (70 examples, 3 epochs):

| Metric | Start → End |
|--------|------------|
| Loss | 1.52 → 0.88 (−42%) |
| Token accuracy | 69% → 79% |
| Eval quality score | 85% |

**With the 7B model + QLoRA, expect significantly better results** — the CFD paper achieved 88.7% accuracy with this exact recipe on a similarly-sized domain-specific dataset.

## 📚 Dataset Details

**70 examples** covering 13 GIS Python libraries:

| Library | Examples | Why Important |
|---------|----------|---------------|
| OSMnx | 9 | **All models score 0%** — routing, POIs, isochrones |
| Rasterio | 9 | Satellite imagery, DEM, NDVI, reprojection |
| GeoPandas | 25 | Core: spatial joins, buffering, I/O |
| Shapely | 14 | Geometry operations, validation |
| MovingPandas | 3 | **All models score 0%** — GPS trajectories |
| GDAL | 6 | Raster processing, format conversion |
| PyProj | 2 | CRS handling (critical weakness) |
| H3 | 2 | Hexagonal indexing |
| Folium | 1 | Interactive maps |
| Fiona | 2 | Low-level vector I/O |
| xarray | 1 | Climate/raster datacubes |
| PyQGIS | 1 | Desktop GIS scripting |
| PySAL | 1 | Spatial statistics |

Each example includes:
- System prompt establishing GIS expertise
- Natural language instruction
- Step-by-step Chain-of-Thought reasoning
- Complete, documented Python code
- Key points explaining design decisions

## 🔬 Scaling to 20K+ Examples

To maximize quality, use the **OSS-Instruct pattern** (from Magicoder):

1. Crawl GitHub for GIS Python code (`import geopandas`, `import rasterio`, etc.)
2. Use GPT-4o to generate (instruction, solution) pairs from real code snippets
3. Execute and test all generated solutions
4. Add CoT annotations to passing examples (+20.9% pass@1 per CFD paper)

Target: 20K–75K examples for production-grade GIS-Coder.

## 📖 References

| Paper | Key Insight |
|-------|-------------|
| [CFD Fine-tuning](https://arxiv.org/abs/2504.09602) | QLoRA SFT recipe: 7B model beats 72B on domain tasks |
| [MapCoder-Lite](https://arxiv.org/abs/2509.17489) | Qwen2.5-Coder-7B best backbone for code LoRA |
| [GIS Benchmark](https://arxiv.org/abs/2410.04617) | All models score 0% on OSMNX/MovingPandas |
| [Magicoder](https://arxiv.org/abs/2312.02120) | OSS-Instruct for synthetic data from real code |
| [LoRA Without Regret](https://arxiv.org/abs/2410.13732) | target all-linear, r=64-256, lr=2e-4 |