Instructions to use InstaDeepAI/ChatNT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use InstaDeepAI/ChatNT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="InstaDeepAI/ChatNT", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("InstaDeepAI/ChatNT", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use InstaDeepAI/ChatNT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "InstaDeepAI/ChatNT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InstaDeepAI/ChatNT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/InstaDeepAI/ChatNT
- SGLang
How to use InstaDeepAI/ChatNT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "InstaDeepAI/ChatNT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InstaDeepAI/ChatNT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "InstaDeepAI/ChatNT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "InstaDeepAI/ChatNT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use InstaDeepAI/ChatNT with Docker Model Runner:
docker model run hf.co/InstaDeepAI/ChatNT
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # ChatNT | |
| [ChatNT](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1) is the first multimodal conversational agent designed with a deep understanding of biological sequences (DNA, RNA, proteins). | |
| It enables users — even those with no coding background — to interact with biological data through natural language and it generalizes across multiple biological tasks and modalities. | |
| **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) | |
| ### Model Sources | |
| <!-- Provide the basic links for the model. --> | |
| - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) | |
| - **Paper:** [ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1.full.pdf) | |
| ### License Summary | |
| 1. The Licensed Models are **only** available under this License for Non-Commercial Purposes. | |
| 2. You are permitted to reproduce, publish, share and adapt the Output generated by the Licensed Model only for Non-Commercial Purposes and in accordance with this License. | |
| 3. You may **not** use the Licensed Models or any of its Outputs in connection with: | |
| 1. any Commercial Purposes, unless agreed by Us under a separate licence; | |
| 2. to train, improve or otherwise influence the functionality or performance of any other third-party derivative model that is commercial or intended for a Commercial Purpose and is similar to the Licensed Models; | |
| 3. to create models distilled or derived from the Outputs of the Licensed Models, unless such models are for Non-Commercial Purposes and open-sourced under the same license as the Licensed Models; or | |
| 4. in violation of any applicable laws and regulations. | |
| ### Architecture and Parameters | |
| ChatNT is built on a three‑module design: a 500M‑parameter [Nucleotide Transformer v2](https://www.nature.com/articles/s41592-024-02523-z) DNA encoder pre‑trained on genomes from 850 species | |
| (handling up to 12 kb per sequence, Dalla‑Torre et al., 2024), an English‑aware Perceiver Resampler that linearly projects and gated‑attention compresses | |
| 2048 DNA‑token embeddings into 64 task‑conditioned vectors (REF), and a frozen 7B‑parameter [Vicuna‑7B](https://lmsys.org/blog/2023-03-30-vicuna/) decoder. | |
| Users provide a natural‑language prompt containing one or more `<DNA>` placeholders and the corresponding DNA sequences (tokenized as 6‑mers). | |
| The projection layer inserts 64 resampled DNA embeddings at each placeholder, and the Vicuna decoder generates free‑form English responses in | |
| an autoregressive fashion, using low‑temperature sampling to produce classification labels, multi‑label statements, or numeric values. | |
| ### Training Data | |
| ChatNT was instruction‑tuned on a unified corpus covering 27 diverse tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes. | |
| This amounted to 605 million DNA tokens (≈ 3.6 billion bases) and 273 million English tokens, sampled uniformly over tasks for 2 billion instruction tokens. | |
| Examples of questions and sequences for each task, as well as additional task information, can be found in [Datasets_overview.csv](Datasets_overview.csv). | |
| ### Tokenization | |
| DNA inputs are broken into overlapping 6‑mer tokens and padded or truncated to 2048 tokens (~ 12 kb). English prompts and | |
| outputs use the LLaMA tokenizer, augmented with `<DNA>` as a special token to mark sequence insertion points. | |
| ### Limitations and Disclaimer | |
| ChatNT can only handle questions related to the 27 tasks it has been trained on, including the same format of DNA sequences. ChatNT is **not** a clinical or diagnostic tool. | |
| It can produce incorrect or “hallucinated” answers, particularly on out‑of‑distribution inputs, and its numeric predictions may suffer digit‑level errors. Confidence | |
| estimates require post‑hoc calibration. Users should always validate critical outputs against experiments or specialized bioinformatics | |
| pipelines. | |
| ### Other notes | |
| We also provide the params for the ChatNT jax model in `jax_params`. | |
| ## How to use | |
| Until its next release, the transformers library needs to be installed from source with the following command in order to use the models. | |
| PyTorch should also be installed. | |
| ``` | |
| pip install --upgrade git+https://github.com/huggingface/transformers.git | |
| pip install torch sentencepiece | |
| ``` | |
| A small snippet of code is given here in order to **generate ChatNT answers from a pipeline (high-level)**. | |
| - The prompt used for training ChatNT is already incorporated inside the pipeline and is the following: | |
| "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful, | |
| detailed, and polite answers to the user's questions." | |
| ``` | |
| # Load pipeline | |
| from transformers import pipeline | |
| pipe = pipeline(model="InstaDeepAI/ChatNT", trust_remote_code=True) | |
| # Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences)) | |
| english_sequence = "Is there any evidence of an acceptor splice site in this sequence <DNA> ?" | |
| dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"] | |
| # Generate sequence | |
| generated_english_sequence = pipe( | |
| inputs={ | |
| "english_sequence": english_sequence, | |
| "dna_sequences": dna_sequences | |
| } | |
| ) | |
| # Expected output: "Yes, an acceptor splice site is without question present in the sequence." | |
| ``` | |
| A small snippet of code is given here in order to **infer with the model without any abstraction (low-level)**. | |
| ``` | |
| import numpy as np | |
| from transformers import AutoModel, AutoTokenizer | |
| # Load model and tokenizers | |
| model = AutoModel.from_pretrained("InstaDeepAI/ChatNT", trust_remote_code=True) | |
| english_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="english_tokenizer") | |
| bio_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="bio_tokenizer") | |
| # Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences)) | |
| # Here the english sequence should include the prompt | |
| english_sequence = "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Is there any evidence of an acceptor splice site in this sequence <DNA> ?" | |
| dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"] | |
| # Tokenize | |
| english_tokens = english_tokenizer(english_sequence, return_tensors="pt", padding="max_length", truncation=True, max_length=512).input_ids | |
| bio_tokens = bio_tokenizer(dna_sequences, return_tensors="pt", padding="max_length", max_length=512, truncation=True).input_ids.unsqueeze(0) # unsqueeze to simulate batch_size = 1 | |
| # Predict | |
| outs = model( | |
| multi_omics_tokens_ids=(english_tokens, bio_tokens), | |
| projection_english_tokens_ids=english_tokens, | |
| projected_bio_embeddings=None, | |
| ) | |
| # Expected output: Dictionary of logits and projected_bio_embeddings | |
| ``` |