PopulationHealthScreener (Anthro)

This Model Card provides the details of the model from "A fine-tuned transformer model for population health surveillance".

  • To use PopulationHealthScreener (Anthro) in a new PubMed search, please use the open-access application below:

    Open in app

  • To use PopulationHealthScreener (Anthro) in Python, please go to this section.

  • To use the fine-tuned encoder PopulationHealthBERT , please go to this section.

  • To provide feedback on the model, please email ncdrisc@imperial.ac.uk.


  • Developed by: Fulvio Deo, Vishwa Nath, Majid Ezzati, Bin Zhou โ€” NCD Risk Factor Collaboration (NCD-RisC), Imperial College London
  • License: BSD-3-clause-clear
  • Base model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
  • Paper: [to be added upon publication]
  • Training data and code: [to be added upon publication]

Description

PopulationHealthScreener (Anthro) is trained to screen articles reporting objective measurements of anthropoemtrics (height, weight, waist and hip circumference) from population-representative samples of people aged five years and older. These data sources are essential for the population health surveillance of underweight, overweight, and obesity, which affects billions of people worldwide.

PopulationHealthScreener (Anthro) classifies each abstract as inclusion or exclusion. Users choose a recall target โ€” the proportion of truly suitable articles to retrieve โ€” and the model screens accordingly. For example, at a recall target of 95%, a user can expect to miss only 5% of the truly suitable articles and eliminate the need to manually review around 65% of articles (work saved). Among the identified articles after screening, the user can expect 56% are truly suitable (precision).

Expected work saved and precision at each target recall level

Recall Work saved Precision
95% 65.7% 56.1%
90% 71.3% 63.5%
80% 78.5% 75.5%
70% 82.2% 79.4%
60% 85.5% 83.4%

Alternatively, users can use only the encoder component of the model โ€” called PopulationHealthBERT โ€” to encode abstracts into vectors. Since PopulationHealthBERT is good at distinguishing sampling designs that are suitable for population health surveillance, its vectors can be used as input to train a classifier to screen articles for a different health metric.


How to use: PopulationHealthScreener (Anthro)

PopulationHealthScreener (Anthro) can be used via an open-access interface using the launch button at the top of the page, or clicking this link.

Alternative, PopulationHealthScreener (Anthro) can be used directly in Python:

import pandas as pd
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# ----- Configuration -----
THRESHOLD = 0.5  # Change this as needed
BASE_MODEL = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext"
FINE_TUNED_MODEL = "ncdrisc/PopulationHealthScreener-Anthro"

# ----- Model deployment -----

# Load articles to screen
df = pd.read_csv("new_articles_to_screen.csv")  # Required columns: "title", "abstract"

# Concatenate title and abstract
df["title_abstract"] = df["title"].fillna("") + " " + df["abstract"].fillna("")

# Tokenise the text
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=False)
inputs = tokenizer(
    df["title_abstract"].tolist(),
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

# Instantiate the model
model = AutoModelForSequenceClassification.from_pretrained(
    BASE_MODEL,
    num_labels=1,
    problem_type="multi_label_classification"
)
model.load_state_dict(
    torch.load(
        hf_hub_download(
            repo_id=FINE_TUNED_MODEL,
            filename="model.pth"
        )
    ),
    strict=False
)

# Make predictions
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.sigmoid(outputs.logits).squeeze(-1)
    preds = (probs >= THRESHOLD).long()

# Store results
df["prob"] = probs.numpy()
df["prediction"] = preds.numpy()

# Save output
df.to_csv("predictions.csv", index=False)

How to use: the fine-tuned encoder PopulationHealthBERT

PopulationHealthBERT is the fine-tuned encoder component of PopulationHealthScreener (Anthro). It produces a 768-dimensional vector representation of each abstract.

PopulationHealthBERT has been trained to encode features of sampling design that determine whether a data source is suitable for population health surveillance โ€” for example, it distinguishes measurements from primary school students and measurements from university students, even when the two are described in similar language. The encodings can be used to visualise and explore a set of abstracts.

Researchers can also leverage PopulationHealthBERT to train a new classifier for a different application. They would start with a dataset of abstracts labelled according to any number of specific inclusion and exclusion criteria. They would directly use PopulationHealthBERT to encode the abstracts and then train a classifier on these pre-generated encodings, making the training stage efficient. Once trained, the model can be used directly to classify new abstracts for the same application, much like PopulationHealthScreener (Anthro) can be used for the surveillance of underweight, overweight, and obesity.

To train a new model and use it to screen articles:

import pandas as pd
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer, AutoModel
from sklearn.linear_model import LogisticRegression  # or any other classifier

# ----- Configuration -----
BASE_MODEL = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext"
FINE_TUNED_MODEL = "ncdrisc/PopulationHealthScreener-Anthro"
THRESHOLD = 0.5  # adjust as needed

# ----- Model training -----

# Load labelled dataset for training
train_df = pd.read_csv("train.csv")  # Required columns: "title", "abstract", "label"

# Concatenate title and abstract
train_df["title_abstract"] = train_df["title"].fillna("") + " " + train_df["abstract"].fillna("")

# Tokenise the text
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=False)
train_inputs = tokenizer(
    train_df["title_abstract"].tolist(),
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

# Instantiate the encoder
model = AutoModel.from_pretrained(BASE_MODEL)
model.load_state_dict(
    torch.load(
        hf_hub_download(
            repo_id=FINE_TUNED_MODEL,
            filename="model.pth"
        )
    ),
    strict=False
)

# Generate encodings
model.eval()
with torch.no_grad():
    train_outputs = model(**train_inputs)
    train_encodings = train_outputs.last_hidden_state[:, 0, :]  # CLS token

# Train the classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(train_encodings.numpy(), train_df["label"].values)

# ----- Model deployment -----

# Load articles to screen
df = pd.read_csv("new_articles_to_screen.csv")  # Required columns: "title", "abstract"

# Concatenate title and abstract
df["title_abstract"] = df["title"].fillna("") + " " + df["abstract"].fillna("")

# Tokenise the text
inputs = tokenizer(
    df["title_abstract"].tolist(),
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

# Generate encodings
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    encodings = outputs.last_hidden_state[:, 0, :]

# Make predictions
probs = clf.predict_proba(encodings.numpy())[:, 1]  # probability of positive class
preds = (probs >= THRESHOLD).astype(int)

# Store results
df["prob"] = probs
df["prediction"] = preds

# Save output
df.to_csv("predictions.csv", index=False)

Training details

Data

  • Source: NCD Risk Factor Collaboration (NCD-RisC). Trends in adult body-mass index in 200 countries from 1975 to 2014: a pooled analysis of 1698 population-based measurement studies with 19.2 million participants. Lancet 2016, 387:1377-1396
  • 20,694 articles; 4,174 inclusions (20.2%)
  • Inclusion and exclusion criteria: articles reporting objective measurements of height and weight from population-representative samples aged five years and older; see detailed criteria here
  • Labels assigned by expert reviewers; consensus applied under uncertainty
  • Input: titles and abstracts, tokenised with BiomedBERT tokeniser (uncased), truncated to 512 tokens

Training procedure

  • Base model: BiomedBERT (microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext), 12 encoder blocks
  • First 6 encoder blocks frozen throughout training
  • Classification error backpropagated through the classifier and the last 6 encoder blocks
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ncdrisc/PopulationHealthScreener-Anthro