India’s artificial intelligence landscape has reached a significant milestone with Sarvam AI’s launch of Sarvam-1, the country’s first homegrown multilingual large language model (LLM). The model, built specifically for Indian languages, represents a major breakthrough in making advanced AI technology accessible to India’s diverse linguistic population.
Developed with 2 billion parameters, Sarvam-1 supports 10 major Indian languages – Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu – alongside English. The model was built from scratch using domestic AI infrastructure powered by NVIDIA H100 Tensor Core GPUs, in collaboration with key partners including NVIDIA, Yotta, and AI4Bharat.
The innovation addresses two critical challenges in Indian language computing: token inefficiency and poor data quality. Sarvam-1’s tokenizer has achieved remarkable efficiency with fertility rates of 1.4 to 2.1 tokens per word, a significant improvement over existing models that require 4-8 tokens for Indian languages. This enhancement leads to faster processing and more efficient language handling.
On the performance front, Sarvam-1 has demonstrated impressive results across various benchmarks. The model achieved an accuracy of 86.11 on the TriviaQA benchmark across Indic languages, substantially outperforming Llama-3.1 8B’s score of 61.47. Its performance on the IndicGenBench for cross-lingual tasks has also been noteworthy, achieving an average chrF++ score of 46.81 on Flores for English-to-Indic translation.
The model’s training corpus, Sarvam-2T, comprises approximately 2 trillion tokens, with content distributed across supported languages. Hindi constitutes about 20% of the dataset, while other languages share the remaining portion equally. The dataset also includes substantial English and programming language content, enabling strong performance across both monolingual and multilingual tasks.
Key Statistics:
- 2 billion parameters in the model
- 10 Indian languages supported plus English
- 1.4-2.1 tokens per word efficiency rate
- 4-6 times faster inference speed compared to larger models
- 86.11 accuracy on TriviaQA benchmark
- $41 million Series A funding secured in December 2023
The launch comes at a crucial time for India’s GenAI market, which is projected to grow at a CAGR of 48% between 2023 and 2030, potentially becoming a $17 billion opportunity. This development has significant implications for the Indian startup ecosystem, particularly in democratizing AI access across language barriers and establishing India’s capability to develop sophisticated AI models domestically.
Sarvam-1’s launch marks a turning point in India’s AI journey, demonstrating that carefully curated training data can yield superior performance even with modest parameter counts. As the model becomes available on Hugging Face, it opens new possibilities for developers and businesses to create language-inclusive applications, potentially transforming how millions of Indians interact with technology in their preferred languages.