GIS-Coder-7B / TRAINING_README.md
RhodWeo's picture
Add TRAINING_README.md
b7efbc7 verified

GIS-Coder 7B β€” Training Package

Fine-tune Qwen2.5-Coder-7B-Instruct into a GIS code specialist using QLoRA SFT.

πŸ“ What's Included

File Description
train_7b.py Production training script with CLI args
evaluate.py Evaluation on 12 GIS benchmarks with scoring
requirements.txt All dependencies

Dataset: RhodWeo/gis-code-instructions β€” 70 expert-curated GIS code examples covering 13 Python libraries.

πŸš€ Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Login to HuggingFace

huggingface-cli login

3. Train (single GPU)

# Default settings (recommended for A100 80GB)
python train_7b.py

# A10G / RTX 4090 (24GB) β€” reduce batch size
python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048

# H100 β€” can afford larger batch and sequence length
python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192

# Full precision LoRA (no quantization, needs ~30GB)
python train_7b.py --no_quantize --batch_size 1

# With Flash Attention (faster, needs flash-attn installed)
python train_7b.py --use_flash_attn

# With Trackio monitoring
python train_7b.py --use_trackio --trackio_project my-gis-coder

4. Multi-GPU

accelerate launch --num_processes 2 train_7b.py --batch_size 2 --grad_accum 4

5. Evaluate

# Evaluate fine-tuned model
python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B

# Compare with base model
python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B --compare_base

# Evaluate local checkpoint
python evaluate.py --adapter_id ./gis-coder-7b-output/final

βš™οΈ Hyperparameter Guide

Recommended defaults (battle-tested recipe):

Parameter Value Source
--lr 2e-4 LoRA Without Regret (10Γ— base SFT rate)
--lora_r 32 MapCoder-Lite optimal for code tasks
--lora_alpha 16 Ξ±/r = 0.5
--target_modules all-linear LoRA Without Regret
--epochs 3 CFD paper: peak at epoch 2, decline after 4
--scheduler cosine Standard for LoRA
--warmup_ratio 0.1 CFD paper: 10% warmup
--max_length 4096 Covers longest GIS code examples

Hardware-specific settings:

GPU VRAM --batch_size --grad_accum --max_length Notes
RTX 3090 24GB 1 16 2048 QLoRA only
RTX 4090 24GB 1 16 2048 QLoRA, slightly faster
A10G 24GB 1 16 2048 QLoRA only
L40S 48GB 2 8 4096 QLoRA or LoRA
A100 40GB 40GB 2 8 4096 Recommended minimum
A100 80GB 80GB 2 8 4096 Ideal
H100 80GB 4 4 8192 Fastest

Ablation ideas:

# Higher LoRA rank (more capacity, slower)
python train_7b.py --lora_r 64 --lora_alpha 32

# Lower learning rate (more stable, slower convergence)
python train_7b.py --lr 5e-5

# More epochs (risk overfitting on 70 examples)
python train_7b.py --epochs 5

# Target only attention layers (fewer params, faster)
python train_7b.py --target_modules q_proj,k_proj,v_proj,o_proj

πŸ“Š Expected Results

From our CPU training run with 0.5B base model (70 examples, 3 epochs):

Metric Start β†’ End
Loss 1.52 β†’ 0.88 (βˆ’42%)
Token accuracy 69% β†’ 79%
Eval quality score 85%

With the 7B model + QLoRA, expect significantly better results β€” the CFD paper achieved 88.7% accuracy with this exact recipe on a similarly-sized domain-specific dataset.

πŸ“š Dataset Details

70 examples covering 13 GIS Python libraries:

Library Examples Why Important
OSMnx 9 All models score 0% β€” routing, POIs, isochrones
Rasterio 9 Satellite imagery, DEM, NDVI, reprojection
GeoPandas 25 Core: spatial joins, buffering, I/O
Shapely 14 Geometry operations, validation
MovingPandas 3 All models score 0% β€” GPS trajectories
GDAL 6 Raster processing, format conversion
PyProj 2 CRS handling (critical weakness)
H3 2 Hexagonal indexing
Folium 1 Interactive maps
Fiona 2 Low-level vector I/O
xarray 1 Climate/raster datacubes
PyQGIS 1 Desktop GIS scripting
PySAL 1 Spatial statistics

Each example includes:

  • System prompt establishing GIS expertise
  • Natural language instruction
  • Step-by-step Chain-of-Thought reasoning
  • Complete, documented Python code
  • Key points explaining design decisions

πŸ”¬ Scaling to 20K+ Examples

To maximize quality, use the OSS-Instruct pattern (from Magicoder):

  1. Crawl GitHub for GIS Python code (import geopandas, import rasterio, etc.)
  2. Use GPT-4o to generate (instruction, solution) pairs from real code snippets
  3. Execute and test all generated solutions
  4. Add CoT annotations to passing examples (+20.9% pass@1 per CFD paper)

Target: 20K–75K examples for production-grade GIS-Coder.

πŸ“– References

Paper Key Insight
CFD Fine-tuning QLoRA SFT recipe: 7B model beats 72B on domain tasks
MapCoder-Lite Qwen2.5-Coder-7B best backbone for code LoRA
GIS Benchmark All models score 0% on OSMNX/MovingPandas
Magicoder OSS-Instruct for synthetic data from real code
LoRA Without Regret target all-linear, r=64-256, lr=2e-4