LABOR-LLM Replication Models
This repository contains replication models for the paper "LABOR-LLM: Language-Based Occupational Representations with Large Language Models" (Athey et al., 2024).
Link to the paper: https://arxiv.org/abs/2406.17972
These models are Llama-2 checkpoints fine-tuned on longitudinal survey data (NLSY79 and NLSY97) to predict labor market transitions. By converting tabular career histories into text-based "resumes," these models leverage the semantic knowledge of LLMs to outperform traditional econometric benchmarks in predicting a worker's next occupation.
π¦ Model Variants & Ablations
The repository hosts 12 specific model checkpoints. You must specify the subfolder argument to load the correct variant.
| Model Size | Dataset | Variant Type | Description |
|---|---|---|---|
| 7B / 13B | NLSY79 / NLSY97 | with_birth_year | Main Model. Uses natural language job titles and includes the worker's birth year in the context window to capture cohort/age effects. |
| 7B | NLSY79 / NLSY97 | numeric | Ablation Baseline. Replaces natural language job titles with unique numeric codes. Used to demonstrate that the LLM's performance drops when semantic job information is removed. |
Note on PSID Models: Models fine-tuned on the Panel Study of Income Dynamics (PSID) are not hosted here due to data licensing restrictions. They will be distributed through the Inter-university Consortium for Political and Social Research (ICPSR).
π Understanding Checkpoints (ckpt_3 vs ckpt_bo5)
- ckpt_3: The model with the lowest validation loss from a training run scheduled for 3 epochs.
- ckpt_bo5: The model with the lowest validation loss from a training run scheduled for 5 epochs.
Note: Due to learning rate scheduling (e.g., warmup and decay steps being calculated based on total epochs), the first 3 epochs of the "5-epoch run" are different from the "3-epoch run." Therefore, ckpt_bo5 is not simply a longer version of ckpt_3; they represent distinct training trajectories. Empirically, we found that the `ckpt_bo5` model achieves better performance.
π Usage
Crucial: You cannot load the model using just the repository ID. You must provide the subfolder name corresponding to the specific variant you want to use.
Installation
pip install transformers torch
Loading a Model
You can load the model both through huggingface hub directly or by downloading the model checkpoints manually to your local disk.
Note For a complete demonstration of usage, please refer to the demo_notebook.ipynb included in this repository.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# 1. Select your variant
# Example: 7B model on NLSY79 with birth year data (Best-of-5 checkpoint)
repo_id = "tianyudu/LABOR_LLM"
variant = "ft_7b_NLSY79_with_birth_year_ckpt_bo5"
# 2. Load Tokenizer & Model
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=variant)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
subfolder=variant,
torch_dtype=torch.float16,
device_map="auto"
)
# 3. Inference
# The model expects a career history formatted as a text sequence.
# Example prompt format (varies by dataset, check paper for exact templates):
input_text = "Born in 1980. In 2002, works as a Waiter. In 2003, works as a"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π Full List of Subfolders
Main Models (Natural Language + Birth Year)
- ft_7b_NLSY79_with_birth_year_ckpt_3
- ft_7b_NLSY79_with_birth_year_ckpt_bo5
- ft_7b_NLSY97_with_birth_year_ckpt_3
- ft_7b_NLSY97_with_birth_year_ckpt_bo5
- ft_13b_NLSY79_with_birth_year_ckpt_3
- ft_13b_NLSY79_with_birth_year_ckpt_bo5
- ft_13b_NLSY97_with_birth_year_ckpt_3
- ft_13b_NLSY97_with_birth_year_ckpt_bo5
Ablation Models (Numeric Codes)
- ft_7b_NLSY79_numeric_ckpt_3
- ft_7b_NLSY79_numeric_ckpt_bo5
- ft_7b_NLSY97_numeric_ckpt_3
- ft_7b_NLSY97_numeric_ckpt_bo5
π Citation
If you use these models, please cite the original paper:
@article{athey2024labor,
title={LABOR-LLM: Language-Based Occupational Representations with Large Language Models},
author={Athey, Susan and Brunborg, Herman and Du, Tianyu and Kanodia, Ayush and Vafa, Keyon},
journal={arXiv preprint arXiv:2406.17972},
year={2024},
url={[https://arxiv.org/abs/2406.17972](https://arxiv.org/abs/2406.17972)}
}