Indian AI startup Sarvam AI has released OpenHathi-Hi-v0.1, the first Hindi large language model (LLM) in the OpenHathi series. Leveraging Meta AI’s Llama2-7B architecture, this model is positioned to deliver performance on par with the renowned GPT-3.5, specifically tailored for Indian languages.
Also Read: Google Gemini vs OpenAI ChatGPT 4
Sarvam AI has Constructed with a 48,000-token extension of Llama2-7B’s tokenizer, OpenHathi-Hi-v0.1 undergoes a meticulous two-phase training process. The initial phase focuses on embedding alignment, strategically aligning randomly initialised Hindi embeddings. The subsequent phase, bilingual language modelling, entails training the model to cross-lingually attend to tokens.
Sarvam AI proudly asserts that OpenHathi-Hi-v0.1 exhibits comparable, if not superior, performance to GPT-3.5 across various Hindi tasks while maintaining proficiency in English. This achievement signifies a significant milestone for the startup, demonstrating its prowess in advancing language models tailored for specific linguistic nuances.
Beyond standard Natural Language Generation (NLG) tasks, Sarvam AI conducted a comprehensive evaluation of OpenHathi-Hi-v0.1’s capabilities in real-world scenarios. The company’s commitment to practical applications underscores the model’s versatility and potential impact across diverse applications.
In a notable collaboration, Sarvam AI joined forces with KissanAI to refine its base model using conversational data gathered from a GPT-powered bot engaging with farmers in different languages. This strategic partnership demonstrates the startup’s dedication to refining and enhancing OpenHathi-Hi-v0.1 through real-world interactions, contributing to its adaptability and effectiveness in dynamic linguistic environments.
The startup, a mere five months old, has rapidly gained recognition and support in the AI landscape. Securing $41 million in a recent funding round led by Lightspeed Ventures, with contributions from Peak XV Partners and Khosla Ventures, Sarvam AI is positioned for continued growth and innovation.
To enhance OpenHathi-Hi-v0.1’s Hindi capabilities, Sarvam AI outlines steps such as reducing the fertility score of its tokenizer in Hindi text to improve efficiency. The company details the creation of a sentence-piece tokenizer from a subsample of 100K documents from the Sangraha corpus, in collaboration with AI4Bharat, resulting in a new tokenizer with a 48K vocabulary.
Sarvam AI’s commitment to linguistic diversity and practical applications, coupled with the strategic partnerships and cutting-edge technology underpinning OpenHathi-Hi-v0.1, positions the startup as a key player in advancing the landscape of large language models, particularly tailored for the nuances of Hindi and other Indian languages. As Sarvam AI continues to evolve, the unveiling of OpenHathi-Hi-v0.1 sets a promising trajectory for the future of AI-driven linguistic innovation.