PTv3-DALES: Point Transformer V3 for Aerial LiDAR Semantic Segmentation
A Point Transformer V3 (PTv3) model trained from scratch on the DALES aerial LiDAR dataset for 9-class semantic segmentation of airborne point clouds.
Model Description
Existing pre-trained point cloud models (e.g., from Open3D-ML) were trained on indoor scenes (S3DIS, ScanNet) or autonomous driving data (SemanticKITTI, nuScenes). These models fail on aerial LiDAR due to fundamental domain mismatch: different viewpoint (top-down vs. street-level), scale (kilometres vs. rooms), and class semantics (ground/buildings/vegetation vs. walls/chairs). This model addresses the gap by training PTv3 directly on aerial LiDAR data.
- Model architecture: Point Transformer V3 (m1_base variant) with U-Net style encoder-decoder
- Parameters: ~46M
- Input: XYZ coordinates + normalised return number (4 channels)
- Output: Per-point class probabilities (9 classes)
- Training framework: PyTorch + Pointcept
Training Details
Dataset
DALES (Dayton Aerial LiDAR Data Set) โ 40 aerial LiDAR tiles (~500 m x 500 m each) from urban and suburban areas:
- Train: 25 tiles (2,567 blocks after spatial splitting)
- Val: 4 tiles (411 blocks, 15% held-out from train split)
- Test: 11 tiles
- Preprocessing: Tiles split into 50 m x 50 m non-overlapping blocks with mean-centred coordinates
Classes
| ID | Class | Train Distribution |
|---|---|---|
| 0 | Unknown | Ignored (weight = 0) |
| 1 | Ground | Dominant |
| 2 | Vegetation | Dominant |
| 3 | Cars | Rare |
| 4 | Trucks | Very rare |
| 5 | Power lines | Rare |
| 6 | Fences | Rare |
| 7 | Poles | Very rare |
| 8 | Buildings | Common |
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 100 |
| Optimizer | AdamW |
| Learning rate | 0.0005 |
| Scheduler | OneCycleLR (warmup 10%) |
| Effective batch size | 16 (batch=1 x accum=16) |
| Loss | Cross-entropy + Lovasz softmax |
| Gradient clipping | max_norm = 1.0 |
| Grid size (voxel) | 0.15 m |
| Max points per sample | 40,000 |
| Mixed precision | Disabled (fp32 only) |
| Class weighting | Inverse-frequency |
Architecture
| Component | Configuration |
|---|---|
| Serialisation orders | z, z-trans, hilbert, hilbert-trans |
| Encoder depths | (2, 2, 2, 6, 2) |
| Encoder channels | (32, 64, 128, 256, 512) |
| Encoder heads | (2, 4, 8, 16, 32) |
| Decoder depths | (2, 2, 2, 2) |
| Decoder channels | (64, 64, 128, 256) |
| Decoder heads | (4, 4, 8, 16) |
| Patch size | 1024 |
| MLP ratio | 4 |
| Drop path | 0.3 |
Compute
- GPU: NVIDIA Tesla V100-PCIE-32GB (single GPU)
- Training time: 27.5 hours (~16.5 min/epoch)
- Precision: fp32 (fp16 AMP causes NaN on V100 due to attention softmax overflow)
Results
Test Set (26.2M points, 11 tiles)
| Class | IoU | Precision | Recall | Support |
|---|---|---|---|---|
| Ground | 95.69% | 97.41% | 98.18% | 12,065,396 |
| Vegetation | 90.01% | 95.80% | 93.71% | 8,691,162 |
| Buildings | 93.73% | 97.57% | 95.98% | 4,880,561 |
| Power lines | 88.63% | 96.11% | 91.92% | 73,735 |
| Cars | 71.40% | 86.05% | 80.74% | 273,015 |
| Poles | 57.48% | 66.48% | 80.94% | 29,046 |
| Fences | 37.76% | 41.06% | 82.43% | 172,857 |
| Trucks | 19.60% | 28.54% | 38.49% | 38,279 |
| Metric | Score |
|---|---|
| Overall Accuracy | 95.88% |
| mIoU (8 classes) | 69.29% |
| Best epoch | 83 |
Validation Set (8.5M points, 4 tiles)
| Metric | Score |
|---|---|
| Overall Accuracy | 92.14% |
| mIoU (8 classes) | 63.46% |
Usage
Evaluate on DALES
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="jayakumarpujar/Ptv3",
filename="base_model_ptv3_dales_.pth",
)
python scripts/train_ptv3_dales.py \
--data_root data/dales_ptv3 \
--eval_only \
--checkpoint $CKPT_PATH \
--no_amp
Fine-tune on custom aerial LiDAR data
# 1. Preprocess your LAS files (same 9-class scheme or remap labels)
python scripts/preprocess_dales_ptv3.py \
--input_dir your_las_data \
--output_dir data/your_dataset
# 2. Fine-tune from pre-trained checkpoint
python scripts/train_ptv3_dales.py \
--data_root data/your_dataset \
--resume $CKPT_PATH \
--epochs 50 --lr 0.0001 \
--no_amp
Checkpoint Contents
The .pth file contains:
{
"epoch": int,
"model_state_dict": OrderedDict, # Full model weights
"optimizer_state_dict": OrderedDict, # AdamW state (for resuming)
"scheduler_state_dict": dict, # OneCycleLR state
"scaler_state_dict": dict, # GradScaler state
"best_miou": float,
"best_epoch": int,
"num_classes": 9,
"class_names": {0: "unknown", 1: "ground", ...},
}
Dependencies
- PyTorch >= 2.0
- Pointcept (with compiled CUDA pointops extension)
- NumPy >= 1.24
- laspy >= 2.5
Limitations
- Trucks (19.6% IoU) and fences (37.8% IoU) have poor performance due to very low representation in the training data and geometric ambiguity with other classes.
- Requires a CUDA GPU โ Pointcept's serialisation-based attention relies on custom CUDA kernels.
- V100 GPUs must use fp32 (
--no_amp); fp16 mixed precision causes NaN in the attention softmax. Ampere+ GPUs (A100, H100) can use AMP and flash attention. - Trained on DALES (Canadian urban/suburban scenes). Performance on other geographic regions or landscapes may vary.
Citation
@inproceedings{wu2024ptv3,
title={Point Transformer V3: Simpler, Faster, Stronger},
author={Wu, Xiaoyang and Jiang, Li and Wang, Peng-Shuai and Liu, Zhijian and Liu, Xihui and Qiao, Yu and Ouyang, Wanli and He, Tong and Zhao, Hengshuang},
booktitle={CVPR},
year={2024}
}
@inproceedings{varney2020dales,
title={DALES: A Large-scale Aerial LiDAR Data Set for Semantic Segmentation},
author={Varney, Nina and Asari, Vijayan K. and Graehling, Quinn},
booktitle={CVPRW},
year={2020}
}
Links
- Training code: opengeos/geoai โ
scripts/train_ptv3_dales.py - Notebook: train_point_cloud_ptv3.ipynb
- PTv3 reference: Pointcept
- DALES dataset: University of Dayton Vision Lab
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support