Image-Text-to-Text
Transformers
Safetensors
trinity_vlm
text-generation
vision-language-model
multimodal
custom_code
trinity
moondream
conversational
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("NyxKrage/TrinityVLM-Nano", trust_remote_code=True, dtype="auto")Quick Links
TrinityVLM
Trinity VLM is a vision model built on top of arcee-ai/Trinity-Nano-Preview using the vision encoder extracted from moondream/moondream3-preview
This is not inteded to be a good model, but is only an experiment in adding vision capabilites to a text-only model from scratch.
The model is trained using the following datamix:
- 20% anthracite-org/pixmo-cap-images
- 30% anthracite-org/pixmo-cap-qa-images
- 25% anthracite-org/pixmo-point-explanations-images
- 25% nvidia/Llama-Nemotron-Post-Training-Dataset chat examples with irrelevant PixMo images attached to avoid overfitting on image explaination when the prompt do not require image context.
The model is licensed under the BSL 1.1 terms of Moondream 3.
- Downloads last month
- 37
Model tree for NyxKrage/TrinityVLM-Nano
Base model
arcee-ai/Trinity-Nano-Base-Pre-Anneal Finetuned
arcee-ai/Trinity-Nano-Base Finetuned
arcee-ai/Trinity-Nano-Preview
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="NyxKrage/TrinityVLM-Nano", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)