Post
2982
๐ Excited to share our technical report on the Southeast Asian multilingual model Sailor2 and its latest updates!
Our 49-page report details Sailor2's development journey, including multilingual data cleaning, small model data mixture simulations, multi-stage continual pre-training, multi-stage post-training, and multi-cultural multi-lingual evaluations. Sailor2 aims to streamline the multilingual model pre-training process efficiently for the community.
๐งญ We highlight Sailor2's impressive performance in low-resource language translation scenarios and its cultural understanding advantages in Southeast Asia, promoting practical applications for regional languages.
Model updates include:ย
๐ก More precise outputs: Reduced redundancy in model outputs through refined post-training data and optimization techniques.ย
๐ Handling longer texts: Expanded to handle up to 128K context length in Southeast Asian languages through long-text training.ย
โก๏ธ Faster inference: Achieved 2.5x faster inference speed with speculative decoding.ย
๐ช๏ธ More model sizes: Introduced new sizes of 3B and 14B through model pruning.
๐ All models are Apache-licensed for commercial use; development tools (code, resources) are open-source.
๐ Technical report: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs (2502.12982)ย
๐ค๏ธ Models: sail/sailor2-language-models-674d7c9e6b4dbbd9a869906bย
๐ฌ Demo: sail/Sailor2-20B-Chatย
๐ฃ Sailor2 community:
sailor2
Our 49-page report details Sailor2's development journey, including multilingual data cleaning, small model data mixture simulations, multi-stage continual pre-training, multi-stage post-training, and multi-cultural multi-lingual evaluations. Sailor2 aims to streamline the multilingual model pre-training process efficiently for the community.
๐งญ We highlight Sailor2's impressive performance in low-resource language translation scenarios and its cultural understanding advantages in Southeast Asia, promoting practical applications for regional languages.
Model updates include:ย
๐ก More precise outputs: Reduced redundancy in model outputs through refined post-training data and optimization techniques.ย
๐ Handling longer texts: Expanded to handle up to 128K context length in Southeast Asian languages through long-text training.ย
โก๏ธ Faster inference: Achieved 2.5x faster inference speed with speculative decoding.ย
๐ช๏ธ More model sizes: Introduced new sizes of 3B and 14B through model pruning.
๐ All models are Apache-licensed for commercial use; development tools (code, resources) are open-source.
๐ Technical report: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs (2502.12982)ย
๐ค๏ธ Models: sail/sailor2-language-models-674d7c9e6b4dbbd9a869906bย
๐ฌ Demo: sail/Sailor2-20B-Chatย
๐ฃ Sailor2 community: