FinBERT Macro Sentiment

A fine-tuned ProsusAI/finbert (109M params, BERT-base) for 3-class sentiment analysis on financial, macroeconomic, and climate/ESG text. Trained on 20K samples from 5 financial NLP datasets.

This model is the default FinBERT head and the backbone of the macro-sentiment-finbert ensemble pipeline. The topic router selects it for general financial news and social/tweet content. It can also be used standalone as a drop-in upgrade for the original ProsusAI/finbert.

Why Fine-Tune FinBERT?

The original ProsusAI/finbert was trained only on Financial PhraseBank (Araci, 2019) — ~4,800 sentences from English financial news. By fine-tuning on a broader mix of 5 datasets spanning news headlines, financial tweets, auditor reports, financial QA, and climate disclosures, we significantly improve recall on minority classes and cross-domain generalization:

Metric Off-the-shelf FinBERT This model Δ
Accuracy ~0.670 0.897 +23pp
F1 (macro) ~0.578 0.881 +30pp
Negative recall ~0.47 0.89 +42pp
Positive recall ~0.35 0.91 +56pp

The original model's weakness is its strong neutral bias — it achieves decent accuracy by defaulting to "neutral" but badly under-predicts both positive and negative classes. Our broader training mix and class weighting fix this.

Quick Start

Standalone Usage

from transformers import pipeline

pipe = pipeline("text-classification", model="peyterho/finbert-macro-sentiment", top_k=None)

pipe("Tesla shares surged 15% after beating earnings expectations.")
# [[{'label': 'positive', 'score': 0.99}, {'label': 'neutral', 'score': 0.01}, {'label': 'negative', 'score': 0.00}]]

pipe("Markets crashed amid recession fears and massive layoffs.")
# [[{'label': 'negative', 'score': 0.99}, {'label': 'neutral', 'score': 0.01}, {'label': 'positive', 'score': 0.00}]]

pipe("The Federal Reserve raised rates by 75bps citing persistent inflation.")
# [[{'label': 'neutral', 'score': 0.91}, {'label': 'negative', 'score': 0.07}, {'label': 'positive', 'score': 0.02}]]

As Part of the Full Pipeline

Within the macro-sentiment-finbert pipeline, this model is the default head — selected for general financial news, social media, and any text that doesn't trigger the policy or climate routing keywords:

from macro_sentiment import MacroSentimentPipeline

pipe = MacroSentimentPipeline(device="cpu")

result = pipe("$AAPL crushed earnings, revenue up 12% YoY. Bullish!")
print(result.summary())
# Sentiment: Positive (+0.687) | Policy: Neutral (+0.000) | Crisis: Normal (0.000) | Domain: financial_news
print(result.head_used)
# "finbert"

result = pipe("The latest jobs report showed unemployment rising to 4.2%, above consensus.")
print(result.summary())
# Sentiment: Negative (-0.312) | Policy: Neutral (+0.000) | Crisis: Normal (0.052) | Domain: financial_news

Label Mapping

This model preserves FinBERT's native label encoding:

Label ID Label Unified Sentiment Score Mapping
0 positive positive +1.0
1 negative negative -1.0
2 neutral neutral 0.0

Note: FinBERT's label ordering (positive=0, negative=1, neutral=2) differs from the standard convention (negative=0, neutral=1, positive=2). The ensemble pipeline handles this remapping internally. If using standalone, interpret labels by name, not by ID.

Evaluation Results

In-Domain (Combined Test Set — 4,333 samples)

All three fine-tuned heads compared on the same held-out test split:

Model Params Accuracy F1 (macro) F1 (weighted)
RoBERTa-Large 355M 0.9130 0.9023 0.9137
FinBERT (this model) 109M 0.8973 0.8813 0.8984
ClimateBERT 82M 0.8885 0.8716 0.8898

Per-Class Breakdown

              precision    recall  f1-score   support

    negative     0.8051    0.8945    0.8474       711
     neutral     0.9518    0.8925    0.9212      2613
    positive     0.8417    0.9118    0.8754      1009

    accuracy                         0.8973      4333
   macro avg     0.8662    0.8996    0.8813      4333
weighted avg     0.9021    0.8973    0.8984      4333

Balanced recall across all three classes (89–91%) is the key improvement over the original FinBERT, which suffered from strong neutral bias (~90%+ neutral recall but <50% for positive/negative).

Out-of-Domain: Financial Phrasebank (785 samples)

Evaluated on Jean-Baptiste/financial_news_sentiment_mixte_with_phrasebank_75 — PhraseBank (75% agreement) + financial news articles. Not in the training mix.

Model Params Accuracy F1 (macro)
RoBERTa-Large 355M 0.9414 0.9357
ClimateBERT 82M 0.9248 0.9213
FinBERT (this model) 109M 0.9236 0.9134

All three heads generalize well, with >92% accuracy on this OOD financial text benchmark.

Out-of-Domain: Stock News Headlines (30,150 samples)

Evaluated on ic-fspml/stock_news_sentiment — 5-class labels mapped to 3-class. Not in the training mix.

Model Params Accuracy F1 (macro)
RoBERTa-Large 355M 0.7211 0.7265
FinBERT (this model) 109M 0.6781 0.6765
ClimateBERT 82M 0.6472 0.6441

FinBERT's financial pre-training gives it an edge over ClimateBERT on stock-specific headlines. The overall lower numbers reflect the 5→3 class mapping noise and significant domain shift (short, noisy headlines vs. the longer sentences in the training mix).

Comparison Against Baselines

Method Accuracy F1 (macro) Notes
FinBERT (this model) 0.8973 0.8813 109M params, fine-tuned
Off-the-shelf ProsusAI/finbert ~0.670 ~0.578 Same architecture, no fine-tuning on combined data
Dict-only meta-classifier (GBT) 0.6693 0.5781 GradientBoosting on 24 dictionary features
Dict-only rules (LM + Henry) 0.5684 0.5277 Threshold-based, no learned parameters

Training Data

Fine-tuned on 20,034 training samples combined from 5 public financial/climate sentiment datasets:

Dataset Domain Train Test Label Mapping
nickmuchi/financial-classification Financial PhraseBank ~4,800 ~1,200 negative / neutral / positive
zeroshot/twitter-financial-news-sentiment Financial tweets ~9,900 ~2,500 bearish→neg, bullish→pos, neutral
FinanceInc/auditor_sentiment Auditor reports ~3,600 ~900 negative / neutral / positive
pauri32/fiqa-2018 Financial QA + microblog ~938 ~235 Continuous score thresholded at ±0.15
climatebert/climate_sentiment Climate disclosures ~1,000 ~500 risk→neg, neutral, opportunity→pos

Label distribution (train): 15.9% negative, 58.4% neutral, 25.7% positive — handled via class-weighted loss.

Training Details

Hyperparameter Value
Base model ProsusAI/finbert (109M, BERT-base, 12 layers, 768 hidden, 12 heads)
Learning rate 2e-5
Batch size 32 × 2 gradient accumulation = 64 effective
Epochs 6 (best checkpoint at epoch 5 by F1 macro)
Scheduler Linear decay with 20% warmup
Optimizer AdamW (weight decay 0.01)
Max length 128 tokens
Precision FP16
Class weighting √(inverse frequency) — weights: positive=1.02, negative=1.30, neutral=0.68
Seed 42
Best model selection Highest F1 (macro) on validation set

Class Weighting Strategy

Following the approach in the original FinBERT paper (Araci, 2019, Section 4.4), sqrt-inverse-frequency class weights are applied to the cross-entropy loss to counteract the ~58% neutral class imbalance:

weight[c] = √(N_total / N_class_c) / mean(weights)

This upweights the minority negative class (1.30×) and downweights the majority neutral class (0.68×) while keeping the positive class near unity (1.02×).

Training Curve

Epoch Train Loss Val Loss Accuracy F1 (macro) F1 (weighted)
1 0.894 0.376 0.858 0.838 0.860
2 0.629 0.353 0.859 0.846 0.862
3 0.360 0.317 0.892 0.876 0.893
4 0.160 0.377 0.894 0.879 0.895
5 0.124 0.405 0.897 0.881 0.898
6 0.086 0.429 0.896 0.880 0.897

Validation loss bottomed at epoch 3, but F1 macro continued improving through epoch 5. The divergence between val loss and F1 suggests the model becomes less calibrated (overconfident) on later epochs while still improving classification boundaries — a common pattern with class-weighted loss. The best checkpoint (epoch 5) trades slightly worse calibration for better discrimination.

Role in the Ensemble Pipeline

This model serves as the default financial news head in the macro-sentiment-finbert pipeline. It is selected when no specialized routing keyword is triggered, making it the workhorse for general financial content.

Input Text → Topic Router → policy keywords?     → RoBERTa-Large
                          → climate keywords?    → ClimateBERT
                          → non-English text?    → XLM-RoBERTa
                          → default (this model) → FinBERT ★

The pipeline fuses this model's prediction with four dictionary-based signals (Loughran-McDonald, Henry earnings tone, Sautner-style climate exposure, and macro policy/crisis dictionaries) using crisis-adaptive weighting — during crisis language, more weight shifts to the dictionary composite, since crisis keywords carry clearer signal than neural softmax outputs.

Ensemble Siblings

Head Model Params Role
FinBERT peyterho/finbert-macro-sentiment 109M Default — financial news, tweets
RoBERTa-Large peyterho/financial-roberta-large-macro-sentiment 355M Policy/macro communications
ClimateBERT peyterho/climatebert-macro-sentiment 82M Climate/ESG text
XLM-RoBERTa cardiffnlp/twitter-xlm-roberta-base-sentiment-multilingual 278M Non-English text (pre-trained, not fine-tuned)

Custom Fine-Tuning

Fine-tune this model further on your own financial labelled data:

pip install transformers datasets accelerate evaluate torch

python -m macro_sentiment.finetune \
    --data my_labels.csv \
    --text-column headline \
    --label-column sentiment \
    --base-model peyterho/finbert-macro-sentiment \
    --output my-org/my-custom-model \
    --push-to-hub \
    --epochs 4 \
    --lr 1e-5 \
    --batch-size 32

The fine-tuning script (from the parent repo) accepts CSV, TSV, JSON, or JSONL. Labels can be strings ("positive", "negative", "neutral", "bullish", "bearish") or integers (0/1/2). Automatic label remapping handles FinBERT's non-standard label ordering (positive=0, negative=1, neutral=2). Class-weighted loss is applied by default.

Limitations

  • English only — FinBERT was pre-trained on English financial text (Reuters TRC2). Non-English text should use the multilingual head in the full pipeline.
  • 128-token max length — fine-tuned with 128-token truncation. Longer documents (earnings calls, 10-K filings) should be chunked at paragraph or sentence level before scoring.
  • Neutral class dominance — despite class weighting, neutral precision (0.95) is noticeably higher than positive (0.84) or negative (0.81), reflecting the ~58% neutral training distribution. The model remains slightly conservative on polarity predictions.
  • Label noise — the FiQA dataset uses continuous sentiment scores thresholded at ±0.15, introducing boundary noise. The Twitter dataset contains informal language, $cashtags, and abbreviations that differ substantially from formal financial text.
  • No temporal or entity awareness — treats each text independently. Cannot reason about whether "rates rising" is positive (for banks) or negative (for growth stocks) without entity context.
  • FinBERT label ordering — label IDs are {0: positive, 1: negative, 2: neutral}, which differs from the standard convention. Use label names, not IDs, when interpreting results.
  • Parameter count trade-off — at 109M parameters, this model sits between ClimateBERT (82M) and RoBERTa-Large (355M). For maximum accuracy (+1.6 F1 points), use the RoBERTa-Large head; for minimum latency (−25% params), use ClimateBERT.

Citation

@article{araci2019finbert,
    title={FinBERT: Financial Sentiment Analysis with Pre-Trained Language Models},
    author={Araci, Dogu},
    journal={arXiv preprint arXiv:1908.10063},
    year={2019}
}

@article{loughran2011liability,
    title={When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks},
    author={Loughran, Tim and McDonald, Bill},
    journal={The Journal of Finance},
    volume={66},
    number={1},
    pages={35--65},
    year={2011}
}

@article{henry2008earnings,
    title={Are investors influenced by how earnings press releases are written?},
    author={Henry, Elaine},
    journal={Journal of Business Communication},
    volume={45},
    number={4},
    pages={363--407},
    year={2008}
}

Framework Versions

  • Transformers 5.6.2
  • PyTorch 2.11.0+cu130
  • Datasets 4.8.4
  • Tokenizers 0.22.2

License

Apache 2.0

Downloads last month
119
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for peyterho/finbert-macro-sentiment

Base model

ProsusAI/finbert
Finetuned
(95)
this model

Datasets used to train peyterho/finbert-macro-sentiment

Paper for peyterho/finbert-macro-sentiment

Evaluation results

  • Accuracy on Combined Financial Sentiment (5 datasets)
    self-reported
    0.897
  • F1 (macro) on Combined Financial Sentiment (5 datasets)
    self-reported
    0.881
  • F1 (weighted) on Combined Financial Sentiment (5 datasets)
    self-reported
    0.898
  • Accuracy on Jean-Baptiste/financial_news_sentiment_mixte_with_phrasebank_75
    self-reported
    0.924
  • F1 (macro) on Jean-Baptiste/financial_news_sentiment_mixte_with_phrasebank_75
    self-reported
    0.913
  • Accuracy on ic-fspml/stock_news_sentiment
    self-reported
    0.678
  • F1 (macro) on ic-fspml/stock_news_sentiment
    self-reported
    0.676