ChatNT / README.md

Update README.md

16240df verified about 1 year ago

8.24 kB

	---
	library_name: transformers
	pipeline_tag: text-generation
	---

	# ChatNT

	[ChatNT](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1) is the first multimodal conversational agent designed with a deep understanding of biological sequences (DNA, RNA, proteins).
	It enables users — even those with no coding background — to interact with biological data through natural language and it generalizes across multiple biological tasks and modalities.

	Developed by: [InstaDeep](https://huggingface.co/InstaDeepAI)

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
	- Paper: [ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1.full.pdf)


	### License Summary
	1. The Licensed Models are only available under this License for Non-Commercial Purposes.
	2. You are permitted to reproduce, publish, share and adapt the Output generated by the Licensed Model only for Non-Commercial Purposes and in accordance with this License.
	3. You may not use the Licensed Models or any of its Outputs in connection with:
	1. any Commercial Purposes, unless agreed by Us under a separate licence;
	2. to train, improve or otherwise influence the functionality or performance of any other third-party derivative model that is commercial or intended for a Commercial Purpose and is similar to the Licensed Models;
	3. to create models distilled or derived from the Outputs of the Licensed Models, unless such models are for Non-Commercial Purposes and open-sourced under the same license as the Licensed Models; or
	4. in violation of any applicable laws and regulations.

	### Architecture and Parameters
	ChatNT is built on a three‑module design: a 500M‑parameter [Nucleotide Transformer v2](https://www.nature.com/articles/s41592-024-02523-z) DNA encoder pre‑trained on genomes from 850 species
	(handling up to 12 kb per sequence, Dalla‑Torre et al., 2024), an English‑aware Perceiver Resampler that linearly projects and gated‑attention compresses
	2048 DNA‑token embeddings into 64 task‑conditioned vectors (REF), and a frozen 7B‑parameter [Vicuna‑7B](https://lmsys.org/blog/2023-03-30-vicuna/) decoder.

	Users provide a natural‑language prompt containing one or more `<DNA>` placeholders and the corresponding DNA sequences (tokenized as 6‑mers).
	The projection layer inserts 64 resampled DNA embeddings at each placeholder, and the Vicuna decoder generates free‑form English responses in
	an autoregressive fashion, using low‑temperature sampling to produce classification labels, multi‑label statements, or numeric values.

	### Training Data
	ChatNT was instruction‑tuned on a unified corpus covering 27 diverse tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes.
	This amounted to 605 million DNA tokens (≈ 3.6 billion bases) and 273 million English tokens, sampled uniformly over tasks for 2 billion instruction tokens.
	Examples of questions and sequences for each task, as well as additional task information, can be found in [Datasets_overview.csv](Datasets_overview.csv).

	### Tokenization
	DNA inputs are broken into overlapping 6‑mer tokens and padded or truncated to 2048 tokens (~ 12 kb). English prompts and
	outputs use the LLaMA tokenizer, augmented with `<DNA>` as a special token to mark sequence insertion points.

	### Limitations and Disclaimer
	ChatNT can only handle questions related to the 27 tasks it has been trained on, including the same format of DNA sequences. ChatNT is not a clinical or diagnostic tool.
	It can produce incorrect or “hallucinated” answers, particularly on out‑of‑distribution inputs, and its numeric predictions may suffer digit‑level errors. Confidence
	estimates require post‑hoc calibration. Users should always validate critical outputs against experiments or specialized bioinformatics
	pipelines.

	### Other notes
	We also provide the params for the ChatNT jax model in `jax_params`.

	## How to use

	Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
	PyTorch should also be installed.

	```
	pip install --upgrade git+https://github.com/huggingface/transformers.git
	pip install torch sentencepiece
	```

	A small snippet of code is given here in order to generate ChatNT answers from a pipeline (high-level).
	- The prompt used for training ChatNT is already incorporated inside the pipeline and is the following:
	"A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful,
	detailed, and polite answers to the user's questions."

	```
	# Load pipeline
	from transformers import pipeline
	pipe = pipeline(model="InstaDeepAI/ChatNT", trust_remote_code=True)

	# Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences))
	english_sequence = "Is there any evidence of an acceptor splice site in this sequence <DNA> ?"
	dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"]

	# Generate sequence
	generated_english_sequence = pipe(
	inputs={
	"english_sequence": english_sequence,
	"dna_sequences": dna_sequences
	}
	)

	# Expected output: "Yes, an acceptor splice site is without question present in the sequence."
	```

	A small snippet of code is given here in order to infer with the model without any abstraction (low-level).

	```
	import numpy as np
	from transformers import AutoModel, AutoTokenizer

	# Load model and tokenizers
	model = AutoModel.from_pretrained("InstaDeepAI/ChatNT", trust_remote_code=True)
	english_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="english_tokenizer")
	bio_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="bio_tokenizer")

	# Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences))
	# Here the english sequence should include the prompt
	english_sequence = "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Is there any evidence of an acceptor splice site in this sequence <DNA> ?"
	dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"]

	# Tokenize
	english_tokens = english_tokenizer(english_sequence, return_tensors="pt", padding="max_length", truncation=True, max_length=512).input_ids
	bio_tokens = bio_tokenizer(dna_sequences, return_tensors="pt", padding="max_length", max_length=512, truncation=True).input_ids.unsqueeze(0) # unsqueeze to simulate batch_size = 1

	# Predict
	outs = model(
	multi_omics_tokens_ids=(english_tokens, bio_tokens),
	projection_english_tokens_ids=english_tokens,
	projected_bio_embeddings=None,
	)

	# Expected output: Dictionary of logits and projected_bio_embeddings
	```