LABOR-LLM Replication Models

This repository contains replication models for the paper "LABOR-LLM: Language-Based Occupational Representations with Large Language Models" (Athey et al., 2024).

Link to the paper: https://arxiv.org/abs/2406.17972

These models are Llama-2 checkpoints fine-tuned on longitudinal survey data (NLSY79 and NLSY97) to predict labor market transitions. By converting tabular career histories into text-based "resumes," these models leverage the semantic knowledge of LLMs to outperform traditional econometric benchmarks in predicting a worker's next occupation.

📦 Model Variants & Ablations

The repository hosts 12 specific model checkpoints. You must specify the subfolder argument to load the correct variant.

Model Size	Dataset	Variant Type	Description
7B / 13B	NLSY79 / NLSY97	with_birth_year	Main Model. Uses natural language job titles and includes the worker's birth year in the context window to capture cohort/age effects.
7B	NLSY79 / NLSY97	numeric	Ablation Baseline. Replaces natural language job titles with unique numeric codes. Used to demonstrate that the LLM's performance drops when semantic job information is removed.

Note on PSID Models: Models fine-tuned on the Panel Study of Income Dynamics (PSID) are not hosted here due to data licensing restrictions. They will be distributed through the Inter-university Consortium for Political and Social Research (ICPSR).

📝 Understanding Checkpoints (ckpt_3 vs ckpt_bo5)

ckpt_3: The model with the lowest validation loss from a training run scheduled for 3 epochs.
ckpt_bo5: The model with the lowest validation loss from a training run scheduled for 5 epochs.

Note: Due to learning rate scheduling (e.g., warmup and decay steps being calculated based on total epochs), the first 3 epochs of the "5-epoch run" are different from the "3-epoch run." Therefore, ckpt_bo5 is not simply a longer version of ckpt_3; they represent distinct training trajectories. Empirically, we found that the `ckpt_bo5` model achieves better performance.

🚀 Usage

Crucial: You cannot load the model using just the repository ID. You must provide the subfolder name corresponding to the specific variant you want to use.

Installation

pip install transformers torch

Loading a Model

You can load the model both through huggingface hub directly or by downloading the model checkpoints manually to your local disk.

Note For a complete demonstration of usage, please refer to the demo_notebook.ipynb included in this repository.

from transformers import AutoTokenizer, AutoModelForCausalLM  
import torch
# 1. Select your variant  
# Example: 7B model on NLSY79 with birth year data (Best-of-5 checkpoint)  
repo_id = "tianyudu/LABOR_LLM"  
variant = "ft_7b_NLSY79_with_birth_year_ckpt_bo5" 

# 2. Load Tokenizer & Model  
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=variant)  
model = AutoModelForCausalLM.from_pretrained(  
    repo_id,  
    subfolder=variant,  
    torch_dtype=torch.float16,  
    device_map="auto"  
)

# 3. Inference  
# The model expects a career history formatted as a text sequence.  
# Example prompt format (varies by dataset, check paper for exact templates):  
input_text = "Born in 1980. In 2002, works as a Waiter. In 2003, works as a"  
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=10)  
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📂 Full List of Subfolders

Main Models (Natural Language + Birth Year)

ft_7b_NLSY79_with_birth_year_ckpt_3
ft_7b_NLSY79_with_birth_year_ckpt_bo5
ft_7b_NLSY97_with_birth_year_ckpt_3
ft_7b_NLSY97_with_birth_year_ckpt_bo5
ft_13b_NLSY79_with_birth_year_ckpt_3
ft_13b_NLSY79_with_birth_year_ckpt_bo5
ft_13b_NLSY97_with_birth_year_ckpt_3
ft_13b_NLSY97_with_birth_year_ckpt_bo5

Ablation Models (Numeric Codes)

ft_7b_NLSY79_numeric_ckpt_3
ft_7b_NLSY79_numeric_ckpt_bo5
ft_7b_NLSY97_numeric_ckpt_3
ft_7b_NLSY97_numeric_ckpt_bo5

📜 Citation

If you use these models, please cite the original paper:

@article{athey2024labor,  
  title={LABOR-LLM: Language-Based Occupational Representations with Large Language Models},  
  author={Athey, Susan and Brunborg, Herman and Du, Tianyu and Kanodia, Ayush and Vafa, Keyon},  
  journal={arXiv preprint arXiv:2406.17972},  
  year={2024},  
  url={[https://arxiv.org/abs/2406.17972](https://arxiv.org/abs/2406.17972)}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for tianyudu/LABOR_LLM

LABOR-LLM: Language-Based Occupational Representations with Large Language Models

Paper • 2406.17972 • Published Jun 25, 2024