SegFormer-B0 — Road Scene Segmentation (7 Game-Asset Classes)

A SegFormer-B0 model fine-tuned on Cityscapes for 7-class semantic segmentation of road scenes. This model uses a purpose-built taxonomy where every class maps directly to a game element — road texture, sky backdrop, tree sprites, building silhouettes, etc.


Architecture	SegFormer-B0 (Mix Transformer encoder + all-MLP decoder)
Parameters	3.7 M
Input	RGB image, any resolution (resized to 512×512)
Output	7-class pixel mask (upsampled to input resolution at inference)
Best val mIoU	≥ 84.0 %
Format	SafeTensors (14.2 MB)

7-Class Game Taxonomy

ID	Class	IoU	Game Function
0	`road`	97.5 %	Driving surface — grey asphalt texture sampling
1	`sidewalk`	80.3 %	Ground-level non-road surfaces (sidewalk, terrain)
2	`building`	88.1 %	Background vertical structures — building silhouettes
3	`vegetation`	89.1 %	Tall greenery — tree sprite extraction
4	`sky`	92.2 %	Sky band — direct crop for game background
5	`vehicle`	87.8 %	Road obstacles
6	`roadside_object`	52.7 %	Thin vertical roadside elements (poles, signs, people)

Key design decisions

Road ↔ sidewalk kept separate so road color sampling produces pure grey asphalt without contamination from sidewalk / terrain tones.
Terrain grouped with sidewalk, not vegetation — ground-level grass strips serve the same game function as sidewalk.
All building-like structures merged (building, wall, fence) — the game treats them identically as background geometry.
All vehicle types merged — the game treats every vehicle as a road obstacle.
Roadside objects (poles, traffic lights, signs, persons, riders) are grouped into a single thin-element class with fallback stock sprites when extraction quality is low.

Usage

from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
from PIL import Image
import torch
import torch.nn.functional as F

processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-road-scene-7class")
model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-road-scene-7class")

image = Image.open("road_photo.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Upsample logits to original image size
mask = F.interpolate(
    outputs.logits,
    size=image.size[::-1],  # (H, W)
    mode="bilinear",
    align_corners=False,
).argmax(dim=1)[0]

# mask values: 0=road, 1=sidewalk, 2=building, 3=vegetation, 4=sky, 5=vehicle, 6=roadside_object

Training Details

Dataset

Chris1/cityscapes_segmentation — urban street scenes from 50 European cities.

Split	Images
Train	2,975
Val	500

The original Cityscapes masks use label IDs 0–33 stored as 3-channel RGB images. A custom preprocessing pipeline extracts channel 0 and applies a 256-element lookup table to remap all 34 Cityscapes classes into the 7-class game taxonomy in a single vectorized operation (unmapped classes → 255 = ignore).

Hyperparameters

Parameter	Value
Base weights	`nvidia/segformer-b0-finetuned-ade-512-512` (encoder only)
Optimizer	AdamW
Learning rate	6 × 10⁻⁵
LR schedule	Polynomial decay
Warmup	10 % of total steps
Weight decay	0.01
Effective batch size	8 (4 × device · 2 grad accum)
Training resolution	512 × 512
Precision	FP16 mixed precision
Epochs	50
Augmentation	ColorJitter (brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1) — train only
Best-model selection	Highest mean IoU on validation set (`load_best_model_at_end=True`)
Hardware	NVIDIA T4 (16 GB VRAM)

Training Curve

Epoch	mIoU	Road	Sidewalk	Building	Vegetation	Sky	Vehicle	Roadside Obj
1	59.5 %	93.2 %	54.6 %	74.3 %	73.7 %	56.9 %	63.9 %	0.0 %
3	73.8 %	95.4 %	69.4 %	82.7 %	82.5 %	84.3 %	79.4 %	23.1 %
5	80.4 %	96.7 %	76.5 %	86.2 %	87.0 %	88.9 %	84.3 %	43.2 %
9	82.9 %	97.4 %	79.2 %	87.5 %	88.3 %	91.0 %	86.7 %	50.4 %
16	84.0 %	97.5 %	80.3 %	88.1 %	89.1 %	92.2 %	87.8 %	52.7 %

The model converges quickly thanks to transfer learning — the pretrained encoder already understands road scene features; only the 7-class decoder head is learned from scratch.

Implementation Notes

Several non-obvious flags are required for correct training:

ignore_mismatched_sizes=True — the pretrained decoder head has a different number of output classes.
remove_unused_columns=False — prevents the Trainer from dropping image data columns.
label_names=["labels"] — tells the Trainer which key holds the segmentation targets.
do_reduce_labels=False — Cityscapes labels don't need ADE20K-style background subtraction.
Logit upsampling in compute_metrics — SegFormer outputs at ¼ resolution; logits must be upsampled before comparison with ground-truth masks.

Intended Use

This model is designed for a game asset extraction pipeline where a user uploads a road photograph and the runtime transforms it into game elements:

Road shape — fit road boundaries from the road mask; derive perspective and horizon.
Color palette — sample dominant colors from each masked region of the original image.
Sky — crop the sky band directly using the sky mask.
Tree sprites — blob detection on the vegetation mask; crop with alpha transparency.
Building silhouettes — extract from the building mask for background geometry.
Fallback system — when extraction quality is poor for any element, use palette-matched stock assets.

Limitations

Trained exclusively on European urban street scenes (Cityscapes). Performance may degrade on rural roads, highways without sidewalks, non-European road styles, or indoor scenes.
roadside_object class (52.7 % IoU) is the weakest — thin elements like poles and signs are inherently difficult at 512×512 resolution. The intended runtime uses fallback sprites for this class.
Not suitable for safety-critical autonomous driving — the merged taxonomy intentionally discards distinctions (truck vs. car, wall vs. fence) that matter for driving but not for game art.

Citation

Please cite this model if you use it:

@misc{corbetta_segformer_road_scene_7class_2026,
  author       = {Marco Corbetta},
  title        = {segformer-b0-road-scene-7class: SegFormer-B0 fine-tuned on Cityscapes for 7-class game-asset segmentation},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Marco333/segformer-b0-road-scene-7class}}
}

SegFormer:

@inproceedings{xie2021segformer,
  title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
  author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
  booktitle={NeurIPS},
  year={2021}
}

Downloads last month: 224

Safetensors

Model size

3.72M params

Tensor type

F32

Dataset used to train Marco333/segformer-b0-road-scene-7class

Evaluation results

Mean IoU on Cityscapes
validation set self-reported

84.000