Difference between `<image>`, `<img>`, and `<im_start>`

by mbrhd - opened Feb 20, 2025

Feb 20, 2025

What is exactly the difference between all these tokens? <image>, <img>, and <im_start>. It seems they are all related to the image according to the model card. However, in the code snippet the <image> is used for the prompt, and for the training the <img> token was used.

czczup

OpenGVLab org Mar 3, 2025

The differences are as follows:

<image>: This is used as a placeholder in the prompt. In the code, it will eventually be replaced by a sequence that starts with <img>, followed by several <IMG_CONTEXT> tokens (which act as placeholders for the actual visual tokens produced by a Vision Transformer), and ends with </img>.
<img> and </img>: These tokens mark the start and end of an image, respectively. They encapsulate the visual context (i.e., the <IMG_CONTEXT> tokens) that represents the processed image.
<|im_start|>: This token is part of the ChatML template and is not directly related to image processing. It is used for formatting or structuring the dialogue rather than representing any image data.

In summary, <image> is a higher-level placeholder that gets expanded into a specific image token structure (<img>... </img> with visual tokens), while <|im_start|> is a formatting symbol for the chat interface unrelated to image tokens.

czczup changed discussion status to closed Mar 3, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment