File size: 5,722 Bytes
b7efbc7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
# GIS-Coder 7B β€” Training Package

Fine-tune **Qwen2.5-Coder-7B-Instruct** into a GIS code specialist using QLoRA SFT.

## πŸ“ What's Included

| File | Description |
|------|-------------|
| `train_7b.py` | Production training script with CLI args |
| `evaluate.py` | Evaluation on 12 GIS benchmarks with scoring |
| `requirements.txt` | All dependencies |

Dataset: [`RhodWeo/gis-code-instructions`](https://huggingface.co/datasets/RhodWeo/gis-code-instructions) β€” 70 expert-curated GIS code examples covering 13 Python libraries.

## πŸš€ Quick Start

### 1. Install dependencies

```bash
pip install -r requirements.txt
```

### 2. Login to HuggingFace

```bash
huggingface-cli login
```

### 3. Train (single GPU)

```bash
# Default settings (recommended for A100 80GB)
python train_7b.py

# A10G / RTX 4090 (24GB) β€” reduce batch size
python train_7b.py --batch_size 1 --grad_accum 16 --max_length 2048

# H100 β€” can afford larger batch and sequence length
python train_7b.py --batch_size 4 --grad_accum 4 --max_length 8192

# Full precision LoRA (no quantization, needs ~30GB)
python train_7b.py --no_quantize --batch_size 1

# With Flash Attention (faster, needs flash-attn installed)
python train_7b.py --use_flash_attn

# With Trackio monitoring
python train_7b.py --use_trackio --trackio_project my-gis-coder
```

### 4. Multi-GPU

```bash
accelerate launch --num_processes 2 train_7b.py --batch_size 2 --grad_accum 4
```

### 5. Evaluate

```bash
# Evaluate fine-tuned model
python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B

# Compare with base model
python evaluate.py --adapter_id RhodWeo/GIS-Coder-7B --compare_base

# Evaluate local checkpoint
python evaluate.py --adapter_id ./gis-coder-7b-output/final
```

## βš™οΈ Hyperparameter Guide

### Recommended defaults (battle-tested recipe):

| Parameter | Value | Source |
|-----------|-------|--------|
| `--lr` | `2e-4` | LoRA Without Regret (10Γ— base SFT rate) |
| `--lora_r` | `32` | MapCoder-Lite optimal for code tasks |
| `--lora_alpha` | `16` | Ξ±/r = 0.5 |
| `--target_modules` | `all-linear` | LoRA Without Regret |
| `--epochs` | `3` | CFD paper: peak at epoch 2, decline after 4 |
| `--scheduler` | `cosine` | Standard for LoRA |
| `--warmup_ratio` | `0.1` | CFD paper: 10% warmup |
| `--max_length` | `4096` | Covers longest GIS code examples |

### Hardware-specific settings:

| GPU | VRAM | `--batch_size` | `--grad_accum` | `--max_length` | Notes |
|-----|------|----------------|-----------------|----------------|-------|
| RTX 3090 | 24GB | 1 | 16 | 2048 | QLoRA only |
| RTX 4090 | 24GB | 1 | 16 | 2048 | QLoRA, slightly faster |
| A10G | 24GB | 1 | 16 | 2048 | QLoRA only |
| L40S | 48GB | 2 | 8 | 4096 | QLoRA or LoRA |
| A100 40GB | 40GB | 2 | 8 | 4096 | Recommended minimum |
| A100 80GB | 80GB | 2 | 8 | 4096 | Ideal |
| H100 | 80GB | 4 | 4 | 8192 | Fastest |

### Ablation ideas:

```bash
# Higher LoRA rank (more capacity, slower)
python train_7b.py --lora_r 64 --lora_alpha 32

# Lower learning rate (more stable, slower convergence)
python train_7b.py --lr 5e-5

# More epochs (risk overfitting on 70 examples)
python train_7b.py --epochs 5

# Target only attention layers (fewer params, faster)
python train_7b.py --target_modules q_proj,k_proj,v_proj,o_proj
```

## πŸ“Š Expected Results

From our CPU training run with 0.5B base model (70 examples, 3 epochs):

| Metric | Start β†’ End |
|--------|------------|
| Loss | 1.52 β†’ 0.88 (βˆ’42%) |
| Token accuracy | 69% β†’ 79% |
| Eval quality score | 85% |

**With the 7B model + QLoRA, expect significantly better results** β€” the CFD paper achieved 88.7% accuracy with this exact recipe on a similarly-sized domain-specific dataset.

## πŸ“š Dataset Details

**70 examples** covering 13 GIS Python libraries:

| Library | Examples | Why Important |
|---------|----------|---------------|
| OSMnx | 9 | **All models score 0%** β€” routing, POIs, isochrones |
| Rasterio | 9 | Satellite imagery, DEM, NDVI, reprojection |
| GeoPandas | 25 | Core: spatial joins, buffering, I/O |
| Shapely | 14 | Geometry operations, validation |
| MovingPandas | 3 | **All models score 0%** β€” GPS trajectories |
| GDAL | 6 | Raster processing, format conversion |
| PyProj | 2 | CRS handling (critical weakness) |
| H3 | 2 | Hexagonal indexing |
| Folium | 1 | Interactive maps |
| Fiona | 2 | Low-level vector I/O |
| xarray | 1 | Climate/raster datacubes |
| PyQGIS | 1 | Desktop GIS scripting |
| PySAL | 1 | Spatial statistics |

Each example includes:
- System prompt establishing GIS expertise
- Natural language instruction
- Step-by-step Chain-of-Thought reasoning
- Complete, documented Python code
- Key points explaining design decisions

## πŸ”¬ Scaling to 20K+ Examples

To maximize quality, use the **OSS-Instruct pattern** (from Magicoder):

1. Crawl GitHub for GIS Python code (`import geopandas`, `import rasterio`, etc.)
2. Use GPT-4o to generate (instruction, solution) pairs from real code snippets
3. Execute and test all generated solutions
4. Add CoT annotations to passing examples (+20.9% pass@1 per CFD paper)

Target: 20K–75K examples for production-grade GIS-Coder.

## πŸ“– References

| Paper | Key Insight |
|-------|-------------|
| [CFD Fine-tuning](https://arxiv.org/abs/2504.09602) | QLoRA SFT recipe: 7B model beats 72B on domain tasks |
| [MapCoder-Lite](https://arxiv.org/abs/2509.17489) | Qwen2.5-Coder-7B best backbone for code LoRA |
| [GIS Benchmark](https://arxiv.org/abs/2410.04617) | All models score 0% on OSMNX/MovingPandas |
| [Magicoder](https://arxiv.org/abs/2312.02120) | OSS-Instruct for synthetic data from real code |
| [LoRA Without Regret](https://arxiv.org/abs/2410.13732) | target all-linear, r=64-256, lr=2e-4 |