Sarvam AI introduced Sarvam-1, a revolutionary Large Language Model (LLM) created especially for Indian languages, on October 24, 2024. This model is unique because it was trained entirely on domestic infrastructure and is described as India’s first indigenous multilingual LLM.
Sarvam-1, which has almost 2 billion parameters and supports ten major Indian languages in addition to English, promises to improve AI’s capabilities in a linguistically diverse nation like India.
What’s New:
Sarvam-1 is a significant development in artificial intelligence, especially for Indian languages. It supports Indian Languages such as Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, Oriya, Punjabi, and Hindi.
The Sarvam-2T dataset, which comprises around 2 trillion tokens was used to train the model. The model was constructed using cutting-edge domestic AI infrastructure driven by NVIDIA’s H100 GPUs. The purpose of this dataset was to boost the quality of training data for Indic languages.
Key Insight:
One of the most impressive features of Sarvam-1 is its ability to handle token efficiency effectively. In many existing models, words in Indian languages are broken down into 4 to 8 tokens for processing. The token efficiency rate of Sarvam-1, on the other hand, ranges from 1.4 to 2.1 tokens per word. This indicates that compared to its predecessors, it can process information more quickly and effectively. The model also claims to outperform larger models like Meta’s Llama-3.2-3B in several benchmarks while maintaining competitive performance.
The model can be downloaded from 🤗 Hub.
How This Works:
The development of Sarvam-1 involved addressing two major challenges: token inefficiency and poor data quality in Indic languages. By utilizing synthetic data generation techniques, Sarvam AI created a robust training compilation that ensures better performance in tasks like cross-lingual translation and question-answering. The model’s architecture allows it to process language more effectively, making it suitable for practical applications across different devices.
Result:
Sarvam-1 has demonstrated superior performance on industry benchmarks such as MMLU, Arc-Challenge, and IndicGenBench. It achieved an accuracy score of 86.11 on the TriviaQA benchmark across Indic languages, significantly higher than the scores of larger models like Llama-3.1 8B. Moreover, its inference speed is reported to be 4 to 6 times faster than that of larger models, making it particularly effective for real-time applications.
Why This Matters:
The launch of Sarvam-1 is crucial for several reasons:
- Inclusivity: It makes advanced AI technology accessible to speakers of diverse Indian languages.
- Efficiency: The improved token efficiency can lead to faster processing times in applications like chatbots and translation services.
- Local Development: By developing this model domestically, Sarvam AI contributes to India’s growing tech ecosystem and reduces reliance on foreign technology.
This development aligns with India’s ambition to become a leader in AI innovation tailored to its unique linguistic landscape.
We’re Thinking:
The launch of Sarvam-1 may revolutionise the way AI interacts with Indian languages in the future. Its open-source status on websites such as Hugging Face will motivate researchers and developers to investigate potential applications and improve the model. Sarvam-1’s success may encourage similar initiatives in other linguistically diverse regions of the world.