You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

PTv3-DALES: Point Transformer V3 for Aerial LiDAR Semantic Segmentation

A Point Transformer V3 (PTv3) model trained from scratch on the DALES aerial LiDAR dataset for 9-class semantic segmentation of airborne point clouds.

Model Description

Existing pre-trained point cloud models (e.g., from Open3D-ML) were trained on indoor scenes (S3DIS, ScanNet) or autonomous driving data (SemanticKITTI, nuScenes). These models fail on aerial LiDAR due to fundamental domain mismatch: different viewpoint (top-down vs. street-level), scale (kilometres vs. rooms), and class semantics (ground/buildings/vegetation vs. walls/chairs). This model addresses the gap by training PTv3 directly on aerial LiDAR data.

Model architecture: Point Transformer V3 (m1_base variant) with U-Net style encoder-decoder
Parameters: ~46M
Input: XYZ coordinates + normalised return number (4 channels)
Output: Per-point class probabilities (9 classes)
Training framework: PyTorch + Pointcept

Training Details

Dataset

DALES (Dayton Aerial LiDAR Data Set) — 40 aerial LiDAR tiles (~500 m x 500 m each) from urban and suburban areas:

Train: 25 tiles (2,567 blocks after spatial splitting)
Val: 4 tiles (411 blocks, 15% held-out from train split)
Test: 11 tiles
Preprocessing: Tiles split into 50 m x 50 m non-overlapping blocks with mean-centred coordinates

Classes

ID	Class	Train Distribution
0	Unknown	Ignored (weight = 0)
1	Ground	Dominant
2	Vegetation	Dominant
3	Cars	Rare
4	Trucks	Very rare
5	Power lines	Rare
6	Fences	Rare
7	Poles	Very rare
8	Buildings	Common

Hyperparameters

Parameter	Value
Epochs	100
Optimizer	AdamW
Learning rate	0.0005
Scheduler	OneCycleLR (warmup 10%)
Effective batch size	16 (batch=1 x accum=16)
Loss	Cross-entropy + Lovasz softmax
Gradient clipping	max_norm = 1.0
Grid size (voxel)	0.15 m
Max points per sample	40,000
Mixed precision	Disabled (fp32 only)
Class weighting	Inverse-frequency

Architecture

Component	Configuration
Serialisation orders	z, z-trans, hilbert, hilbert-trans
Encoder depths	(2, 2, 2, 6, 2)
Encoder channels	(32, 64, 128, 256, 512)
Encoder heads	(2, 4, 8, 16, 32)
Decoder depths	(2, 2, 2, 2)
Decoder channels	(64, 64, 128, 256)
Decoder heads	(4, 4, 8, 16)
Patch size	1024
MLP ratio	4
Drop path	0.3

Compute

GPU: NVIDIA Tesla V100-PCIE-32GB (single GPU)
Training time: 27.5 hours (~16.5 min/epoch)
Precision: fp32 (fp16 AMP causes NaN on V100 due to attention softmax overflow)

Results

Test Set (26.2M points, 11 tiles)

Class	IoU	Precision	Recall	Support
Ground	95.69%	97.41%	98.18%	12,065,396
Vegetation	90.01%	95.80%	93.71%	8,691,162
Buildings	93.73%	97.57%	95.98%	4,880,561
Power lines	88.63%	96.11%	91.92%	73,735
Cars	71.40%	86.05%	80.74%	273,015
Poles	57.48%	66.48%	80.94%	29,046
Fences	37.76%	41.06%	82.43%	172,857
Trucks	19.60%	28.54%	38.49%	38,279

Metric	Score
Overall Accuracy	95.88%
mIoU (8 classes)	69.29%
Best epoch	83

Validation Set (8.5M points, 4 tiles)

Metric	Score
Overall Accuracy	92.14%
mIoU (8 classes)	63.46%

Usage

Evaluate on DALES

from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="jayakumarpujar/Ptv3",
    filename="base_model_ptv3_dales_.pth",
)

python scripts/train_ptv3_dales.py \
    --data_root data/dales_ptv3 \
    --eval_only \
    --checkpoint $CKPT_PATH \
    --no_amp

Fine-tune on custom aerial LiDAR data

# 1. Preprocess your LAS files (same 9-class scheme or remap labels)
python scripts/preprocess_dales_ptv3.py \
    --input_dir your_las_data \
    --output_dir data/your_dataset

# 2. Fine-tune from pre-trained checkpoint
python scripts/train_ptv3_dales.py \
    --data_root data/your_dataset \
    --resume $CKPT_PATH \
    --epochs 50 --lr 0.0001 \
    --no_amp

Checkpoint Contents

The .pth file contains:

{
    "epoch": int,
    "model_state_dict": OrderedDict,     # Full model weights
    "optimizer_state_dict": OrderedDict,  # AdamW state (for resuming)
    "scheduler_state_dict": dict,         # OneCycleLR state
    "scaler_state_dict": dict,            # GradScaler state
    "best_miou": float,
    "best_epoch": int,
    "num_classes": 9,
    "class_names": {0: "unknown", 1: "ground", ...},
}

Dependencies

PyTorch >= 2.0
Pointcept (with compiled CUDA pointops extension)
NumPy >= 1.24
laspy >= 2.5

Limitations

Trucks (19.6% IoU) and fences (37.8% IoU) have poor performance due to very low representation in the training data and geometric ambiguity with other classes.
Requires a CUDA GPU — Pointcept's serialisation-based attention relies on custom CUDA kernels.
V100 GPUs must use fp32 (--no_amp); fp16 mixed precision causes NaN in the attention softmax. Ampere+ GPUs (A100, H100) can use AMP and flash attention.
Trained on DALES (Canadian urban/suburban scenes). Performance on other geographic regions or landscapes may vary.

Citation

@inproceedings{wu2024ptv3,
    title={Point Transformer V3: Simpler, Faster, Stronger},
    author={Wu, Xiaoyang and Jiang, Li and Wang, Peng-Shuai and Liu, Zhijian and Liu, Xihui and Qiao, Yu and Ouyang, Wanli and He, Tong and Zhao, Hengshuang},
    booktitle={CVPR},
    year={2024}
}

@inproceedings{varney2020dales,
    title={DALES: A Large-scale Aerial LiDAR Data Set for Semantic Segmentation},
    author={Varney, Nina and Asari, Vijayan K. and Graehling, Quinn},
    booktitle={CVPRW},
    year={2020}
}

jayakumarpujar
/

Ptv3