LAPVQA β€” Pretrain (Sigmoid)

Part of the LAPVQA collection.

Description

A ViT-L/14 vision encoder trained from scratch on MIMIC-CXR using a sigmoid (multi-label binary cross-entropy) contrastive loss β€” an alternative to InfoNCE that treats each image-text pair independently rather than competing within the batch.

Architecture

Component Detail
Vision backbone ViT-L/14, 24-layer, 1024-dim, 16-head, patch 14, 384 px
Text encoder 6-layer, 512-dim bidirectional transformer, GPT-2 vocab (50 257)
Projection Linear β†’ 512-dim shared embedding space
Loss Per-pair sigmoid BCE (SigLIP-style)
Training data MIMIC-CXR (physionet.org/content/mimic-cxr)
Epochs 50

Downstream Evaluation (frozen encoder + linear probe)

Dataset Mean AUC
NIH CXR-14 (14-class) 0.650
CheXpert-5 (5-class) 0.785

Files

File Description
encoder_final.pt Vision encoder weights at end of training
model_best.pt Full model at best validation loss
model_epochXXX.pt Periodic epoch snapshots (every 10 epochs)

Usage

import torch
from lapvqa.pretrain.model import ContrastiveModel

ckpt = torch.load("encoder_final.pt", map_location="cpu")
model = ContrastiveModel()
model.vision_encoder.load_state_dict(ckpt)
model.eval()
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including dmusingu/lapvqa-pretrain-sigmoid