SegFormer-B0 β€” Road Scene Segmentation (7 Game-Asset Classes)

A SegFormer-B0 model fine-tuned on Cityscapes for 7-class semantic segmentation of road scenes. This model uses a purpose-built taxonomy where every class maps directly to a game element β€” road texture, sky backdrop, tree sprites, building silhouettes, etc.

Architecture SegFormer-B0 (Mix Transformer encoder + all-MLP decoder)
Parameters 3.7 M
Input RGB image, any resolution (resized to 512Γ—512)
Output 7-class pixel mask (upsampled to input resolution at inference)
Best val mIoU β‰₯ 84.0 %
Format SafeTensors (14.2 MB)

7-Class Game Taxonomy

ID Class IoU Game Function
0 road 97.5 % Driving surface β€” grey asphalt texture sampling
1 sidewalk 80.3 % Ground-level non-road surfaces (sidewalk, terrain)
2 building 88.1 % Background vertical structures β€” building silhouettes
3 vegetation 89.1 % Tall greenery β€” tree sprite extraction
4 sky 92.2 % Sky band β€” direct crop for game background
5 vehicle 87.8 % Road obstacles
6 roadside_object 52.7 % Thin vertical roadside elements (poles, signs, people)

Key design decisions

  • Road ↔ sidewalk kept separate so road color sampling produces pure grey asphalt without contamination from sidewalk / terrain tones.
  • Terrain grouped with sidewalk, not vegetation β€” ground-level grass strips serve the same game function as sidewalk.
  • All building-like structures merged (building, wall, fence) β€” the game treats them identically as background geometry.
  • All vehicle types merged β€” the game treats every vehicle as a road obstacle.
  • Roadside objects (poles, traffic lights, signs, persons, riders) are grouped into a single thin-element class with fallback stock sprites when extraction quality is low.

Usage

from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
from PIL import Image
import torch
import torch.nn.functional as F

processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-road-scene-7class")
model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-road-scene-7class")

image = Image.open("road_photo.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Upsample logits to original image size
mask = F.interpolate(
    outputs.logits,
    size=image.size[::-1],  # (H, W)
    mode="bilinear",
    align_corners=False,
).argmax(dim=1)[0]

# mask values: 0=road, 1=sidewalk, 2=building, 3=vegetation, 4=sky, 5=vehicle, 6=roadside_object

Training Details

Dataset

Chris1/cityscapes_segmentation β€” urban street scenes from 50 European cities.

Split Images
Train 2,975
Val 500

The original Cityscapes masks use label IDs 0–33 stored as 3-channel RGB images. A custom preprocessing pipeline extracts channel 0 and applies a 256-element lookup table to remap all 34 Cityscapes classes into the 7-class game taxonomy in a single vectorized operation (unmapped classes β†’ 255 = ignore).

Hyperparameters

Parameter Value
Base weights nvidia/segformer-b0-finetuned-ade-512-512 (encoder only)
Optimizer AdamW
Learning rate 6 Γ— 10⁻⁡
LR schedule Polynomial decay
Warmup 10 % of total steps
Weight decay 0.01
Effective batch size 8 (4 Γ— device Β· 2 grad accum)
Training resolution 512 Γ— 512
Precision FP16 mixed precision
Epochs 50
Augmentation ColorJitter (brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1) β€” train only
Best-model selection Highest mean IoU on validation set (load_best_model_at_end=True)
Hardware NVIDIA T4 (16 GB VRAM)

Training Curve

Epoch mIoU Road Sidewalk Building Vegetation Sky Vehicle Roadside Obj
1 59.5 % 93.2 % 54.6 % 74.3 % 73.7 % 56.9 % 63.9 % 0.0 %
3 73.8 % 95.4 % 69.4 % 82.7 % 82.5 % 84.3 % 79.4 % 23.1 %
5 80.4 % 96.7 % 76.5 % 86.2 % 87.0 % 88.9 % 84.3 % 43.2 %
9 82.9 % 97.4 % 79.2 % 87.5 % 88.3 % 91.0 % 86.7 % 50.4 %
16 84.0 % 97.5 % 80.3 % 88.1 % 89.1 % 92.2 % 87.8 % 52.7 %

The model converges quickly thanks to transfer learning β€” the pretrained encoder already understands road scene features; only the 7-class decoder head is learned from scratch.

Implementation Notes

Several non-obvious flags are required for correct training:

  • ignore_mismatched_sizes=True β€” the pretrained decoder head has a different number of output classes.
  • remove_unused_columns=False β€” prevents the Trainer from dropping image data columns.
  • label_names=["labels"] β€” tells the Trainer which key holds the segmentation targets.
  • do_reduce_labels=False β€” Cityscapes labels don't need ADE20K-style background subtraction.
  • Logit upsampling in compute_metrics β€” SegFormer outputs at ΒΌ resolution; logits must be upsampled before comparison with ground-truth masks.

Intended Use

This model is designed for a game asset extraction pipeline where a user uploads a road photograph and the runtime transforms it into game elements:

  1. Road shape β€” fit road boundaries from the road mask; derive perspective and horizon.
  2. Color palette β€” sample dominant colors from each masked region of the original image.
  3. Sky β€” crop the sky band directly using the sky mask.
  4. Tree sprites β€” blob detection on the vegetation mask; crop with alpha transparency.
  5. Building silhouettes β€” extract from the building mask for background geometry.
  6. Fallback system β€” when extraction quality is poor for any element, use palette-matched stock assets.

Limitations

  • Trained exclusively on European urban street scenes (Cityscapes). Performance may degrade on rural roads, highways without sidewalks, non-European road styles, or indoor scenes.
  • roadside_object class (52.7 % IoU) is the weakest β€” thin elements like poles and signs are inherently difficult at 512Γ—512 resolution. The intended runtime uses fallback sprites for this class.
  • Not suitable for safety-critical autonomous driving β€” the merged taxonomy intentionally discards distinctions (truck vs. car, wall vs. fence) that matter for driving but not for game art.

Citation

Please cite this model if you use it:

@misc{corbetta_segformer_road_scene_7class_2026,
  author       = {Marco Corbetta},
  title        = {segformer-b0-road-scene-7class: SegFormer-B0 fine-tuned on Cityscapes for 7-class game-asset segmentation},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Marco333/segformer-b0-road-scene-7class}}
}

SegFormer:

@inproceedings{xie2021segformer,
  title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
  author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
  booktitle={NeurIPS},
  year={2021}
}
Downloads last month
224
Safetensors
Model size
3.72M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Marco333/segformer-b0-road-scene-7class

Evaluation results