SoilFormer

A multimodal tabular transformer trained on LUCAS-MEGA.

Manuscript

Introduction

SoilFormer is a multimodal transformer for representation learning in soil–environment systems. It is trained on LUCAS-MEGA, a large-scale dataset built from European soil and environmental observations, with the LUCAS soil survey as its backbone. LUCAS-MEGA integrates heterogeneous sources into a machine-learning-ready sample–feature table, covering numerical, categorical, textual, and visual modalities across soil physical, chemical, hydrological, environmental, and site-related properties.

SoilFormer learns from partially observed multimodal samples using masked feature modeling. During training, a subset of observed categorical and numerical features is masked, and the model reconstructs them from the remaining tabular and visual context. The architecture combines grouped categorical embedding, grouped numerical encoding/decoding, vision feature extraction and compression, transformer layers, and heteroscedastic prediction heads for uncertainty-aware reconstruction.

SoilFormer architecture

Training

Train SoilFormer with:

python modelling/train.py

Main configuration files:

  • config/config_model.json: model architecture parameters, including embedding sizes, transformer layer settings, decoder settings, dtype, and vision model configuration.
  • config/config_data.json: data parameters, including CSV path, vocab paths, numeric statistics, photo mapping, image root, train/eval split, batch size, and masking ratios.
  • config/config_train.json: training hyperparameters, including runtime device, seed, optimizer settings, scheduler settings, checkpoint behavior, loss options, logging, and output paths.

Inference

Inference uses readable JSON input cards. The workflow is:

  1. Create input cards from one dataset row.
  2. Edit the masked card manually if desired.
  3. Run model prediction from the edited card.
  4. Optionally compare predictions against the unmasked answer card.

1. Create input cards

python create_input_card_from_dataset.py \
  --row_index 10 \
  --output example/input_card.json

This writes two files:

example/input_card__unmasked.json
example/input_card__masked.json

The unmasked card contains the raw readable values from the CSV row. The masked card randomly replaces a fraction of categorical and numeric values with null. Natural missing values remain as empty strings "", while active masks are represented as null.

Default masking ratios are 0.15 for both categorical and numeric features:

python create_input_card_from_dataset.py \
  --row_index 10 \
  --output example/input_card.json \
  --cat_mask_ratio 0.15 \
  --num_mask_ratio 0.15 \
  --seed 42

The card format is intentionally simple and user-editable. Users can copy this card as a template, replace the values with their own soil sample information, and set variables to null to indicate which fields should be predicted during inference:

{
  "categorical": {
    "land_site:land_cover_primary": "B16: Cropland => Cereals => Maize",
    "land_site:land_use_primary": null,
    "soil_type:WRB_soil_group": "Cambisol",
    "texture:ISSS_class": "silty clay",
    "...": "..."
  },
  "numeric": {
    "carbon:CaCO3_content (g/kg)": 7.0,
    "carbon:SOC_saturation_ratio": 0.3647958934307098,
    "geographic:latitude (deg)": 38.8513900000485,
    "geographic:longitude (deg)": -9.29050000007487,
    "mass_density:bulk_density (g/cm³)": null,
    "...": "..."
  },
  "vision": {
    "image_path_suffix": "relative/path/to/photo.jpg"
  }
}

2. Run prediction

python inference_predict_output_card.py \
  --checkpoint model_weights/soilformer_pretrain/hetero_epoch_200.pt \
  --input_card example/input_card__masked.json \
  --output example/output_card.json

This writes:

example/output_card.json

output_card.json contains readable predictions:

  • categorical outputs are decoded back to raw category labels;
  • numeric outputs are converted from z-score space back to the original physical units;
  • vision input is read from vision.image_path_suffix together with photo_root in config/config_data.json.

3. Evaluation with an answer card

python inference_predict_output_card.py \
  --checkpoint model_weights/soilformer_pretrain/hetero_epoch_200.pt \
  --input_card example/input_card__masked.json \
  --answer_card example/input_card__unmasked.json \
  --output example/output_card.json

This additionally writes:

example/output_card__acc.json

When --answer_card is provided, output_card__acc.json reports reconstruction metrics over fields that are null in the masked input card:

  • categorical accuracy for masked categorical fields;
  • numeric MAE for masked numeric fields, measured in the original feature units.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train earthroverprogram/soilformer