SegFormer-B0 β Road Scene Segmentation (7 Game-Asset Classes)
A SegFormer-B0 model fine-tuned on Cityscapes for 7-class semantic segmentation of road scenes. This model uses a purpose-built taxonomy where every class maps directly to a game element β road texture, sky backdrop, tree sprites, building silhouettes, etc.
| Architecture | SegFormer-B0 (Mix Transformer encoder + all-MLP decoder) |
| Parameters | 3.7 M |
| Input | RGB image, any resolution (resized to 512Γ512) |
| Output | 7-class pixel mask (upsampled to input resolution at inference) |
| Best val mIoU | β₯ 84.0 % |
| Format | SafeTensors (14.2 MB) |
7-Class Game Taxonomy
| ID | Class | IoU | Game Function |
|---|---|---|---|
| 0 | road |
97.5 % | Driving surface β grey asphalt texture sampling |
| 1 | sidewalk |
80.3 % | Ground-level non-road surfaces (sidewalk, terrain) |
| 2 | building |
88.1 % | Background vertical structures β building silhouettes |
| 3 | vegetation |
89.1 % | Tall greenery β tree sprite extraction |
| 4 | sky |
92.2 % | Sky band β direct crop for game background |
| 5 | vehicle |
87.8 % | Road obstacles |
| 6 | roadside_object |
52.7 % | Thin vertical roadside elements (poles, signs, people) |
Key design decisions
- Road β sidewalk kept separate so road color sampling produces pure grey asphalt without contamination from sidewalk / terrain tones.
- Terrain grouped with sidewalk, not vegetation β ground-level grass strips serve the same game function as sidewalk.
- All building-like structures merged (building, wall, fence) β the game treats them identically as background geometry.
- All vehicle types merged β the game treats every vehicle as a road obstacle.
- Roadside objects (poles, traffic lights, signs, persons, riders) are grouped into a single thin-element class with fallback stock sprites when extraction quality is low.
Usage
from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
from PIL import Image
import torch
import torch.nn.functional as F
processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-road-scene-7class")
model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-road-scene-7class")
image = Image.open("road_photo.jpg")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Upsample logits to original image size
mask = F.interpolate(
outputs.logits,
size=image.size[::-1], # (H, W)
mode="bilinear",
align_corners=False,
).argmax(dim=1)[0]
# mask values: 0=road, 1=sidewalk, 2=building, 3=vegetation, 4=sky, 5=vehicle, 6=roadside_object
Training Details
Dataset
Chris1/cityscapes_segmentation β urban street scenes from 50 European cities.
| Split | Images |
|---|---|
| Train | 2,975 |
| Val | 500 |
The original Cityscapes masks use label IDs 0β33 stored as 3-channel RGB images. A custom preprocessing pipeline extracts channel 0 and applies a 256-element lookup table to remap all 34 Cityscapes classes into the 7-class game taxonomy in a single vectorized operation (unmapped classes β 255 = ignore).
Hyperparameters
| Parameter | Value |
|---|---|
| Base weights | nvidia/segformer-b0-finetuned-ade-512-512 (encoder only) |
| Optimizer | AdamW |
| Learning rate | 6 Γ 10β»β΅ |
| LR schedule | Polynomial decay |
| Warmup | 10 % of total steps |
| Weight decay | 0.01 |
| Effective batch size | 8 (4 Γ device Β· 2 grad accum) |
| Training resolution | 512 Γ 512 |
| Precision | FP16 mixed precision |
| Epochs | 50 |
| Augmentation | ColorJitter (brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1) β train only |
| Best-model selection | Highest mean IoU on validation set (load_best_model_at_end=True) |
| Hardware | NVIDIA T4 (16 GB VRAM) |
Training Curve
| Epoch | mIoU | Road | Sidewalk | Building | Vegetation | Sky | Vehicle | Roadside Obj |
|---|---|---|---|---|---|---|---|---|
| 1 | 59.5 % | 93.2 % | 54.6 % | 74.3 % | 73.7 % | 56.9 % | 63.9 % | 0.0 % |
| 3 | 73.8 % | 95.4 % | 69.4 % | 82.7 % | 82.5 % | 84.3 % | 79.4 % | 23.1 % |
| 5 | 80.4 % | 96.7 % | 76.5 % | 86.2 % | 87.0 % | 88.9 % | 84.3 % | 43.2 % |
| 9 | 82.9 % | 97.4 % | 79.2 % | 87.5 % | 88.3 % | 91.0 % | 86.7 % | 50.4 % |
| 16 | 84.0 % | 97.5 % | 80.3 % | 88.1 % | 89.1 % | 92.2 % | 87.8 % | 52.7 % |
The model converges quickly thanks to transfer learning β the pretrained encoder already understands road scene features; only the 7-class decoder head is learned from scratch.
Implementation Notes
Several non-obvious flags are required for correct training:
ignore_mismatched_sizes=Trueβ the pretrained decoder head has a different number of output classes.remove_unused_columns=Falseβ prevents the Trainer from dropping image data columns.label_names=["labels"]β tells the Trainer which key holds the segmentation targets.do_reduce_labels=Falseβ Cityscapes labels don't need ADE20K-style background subtraction.- Logit upsampling in
compute_metricsβ SegFormer outputs at ΒΌ resolution; logits must be upsampled before comparison with ground-truth masks.
Intended Use
This model is designed for a game asset extraction pipeline where a user uploads a road photograph and the runtime transforms it into game elements:
- Road shape β fit road boundaries from the road mask; derive perspective and horizon.
- Color palette β sample dominant colors from each masked region of the original image.
- Sky β crop the sky band directly using the sky mask.
- Tree sprites β blob detection on the vegetation mask; crop with alpha transparency.
- Building silhouettes β extract from the building mask for background geometry.
- Fallback system β when extraction quality is poor for any element, use palette-matched stock assets.
Limitations
- Trained exclusively on European urban street scenes (Cityscapes). Performance may degrade on rural roads, highways without sidewalks, non-European road styles, or indoor scenes.
roadside_objectclass (52.7 % IoU) is the weakest β thin elements like poles and signs are inherently difficult at 512Γ512 resolution. The intended runtime uses fallback sprites for this class.- Not suitable for safety-critical autonomous driving β the merged taxonomy intentionally discards distinctions (truck vs. car, wall vs. fence) that matter for driving but not for game art.
Citation
Please cite this model if you use it:
@misc{corbetta_segformer_road_scene_7class_2026,
author = {Marco Corbetta},
title = {segformer-b0-road-scene-7class: SegFormer-B0 fine-tuned on Cityscapes for 7-class game-asset segmentation},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Marco333/segformer-b0-road-scene-7class}}
}
SegFormer:
@inproceedings{xie2021segformer,
title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
booktitle={NeurIPS},
year={2021}
}
- Downloads last month
- 224
Dataset used to train Marco333/segformer-b0-road-scene-7class
Evaluation results
- Mean IoU on Cityscapesvalidation set self-reported84.000