Confirmation of bias-detector training data for apples-to-apples academic comparison
Hi Himel,
I'm preparing an academic paper comparing several approaches to media-bias classification, and your bias-detector model (himel7/bias-detector on HuggingFace) is one of the baselines I want to include. Before I freeze my methodology section I'd like to confirm the training-data protocol so my comparisons represent your work fairly.
Your HuggingFace model card states:
"Training was done on the BABE Dataset... evaluated with K-fold Cross Validation (K=5)."
Could you confirm two things for me?
- Published checkpoint's training data
Was the checkpoint currently available on HuggingFace at himel7/bias-detector trained on:
(a) the entire BABE dataset (all 4,121 sentences), as the final model after K-fold evaluation; or
(b) a specific train subset of BABE (e.g., the canonical BABE train split of ~3,121 sentences, holding. out the ~1,000-sentence test split); or
(c) one of the five K-fold training subsets; or
(d) some other protocol?
- Held-out samples
Are there any BABE sentences that the published checkpoint has not seen during training? If yes, could you share which subset was held out? My aim is to run a fair evaluation where I can confirm your model is not scoring on sentences it was trained on.
Why this matters:
I observed binary F1 = 94.66% when running your checkpoint on a 3,121-sentence split of BABE (the train split minus the test split). Given your paper's reported ~92% F1 under K=5 CV on the full dataset, the 94.66% on a specific subset suggests the subset is in the training data for the published checkpoint but I want to avoid misrepresenting your results, so I'd rather hear from you directly than speculate.
If the published checkpoint was trained on all of BABE (which is the normal pattern after K-fold evaluation), my paper will disclose this transparently and move the head-to-head comparison with your model to out-of-distribution datasets, BASIL and anno-lexical where neither system has training-data contamination. This is not a criticism of your methodology, just a constraint I need to respect for the paper's comparisons to be meaningful.
I'm happy to share the bias-detector per-sample predictions and evaluation script I used if that would help verify my setup.
Thanks for your work on bias-detector it's a useful baseline for the field.
Best,
Narrative Control Research
Hi,
You must follow the details written in this paper: https://arxiv.org/abs/2505.13010
To Bias or Not to Bias: Detecting bias in News with bias-detector- Himel Ghosh, Ahmed Mosharafa, Georg Groh
This paper has all the details on the evaluation of this model and the exact checkpoint related to that is uploaded here in Huggingface. The modelcard is not well updated as per the paper, I will revise that soon.
If you still need further details and/or my collaboration, I will be eager to collaborate, in that case, write me at himel.ghosh@tum.de
Please cite this paper in your research.
@misc {ghosh2025biasbiasdetectingbias,
title={To Bias or Not to Bias: Detecting bias in News with bias-detector},
author={Himel Ghosh and Ahmed Mosharafa and Georg Groh},
year={2025},
eprint={2505.13010},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.13010},
}
Thank you Himel7,
We will be in touch.
Best,
Narrative Control Research