Sunday, March 2, 2025
HomeAnalyticsA Multilingual VLM by Krutrim AI Labs

A Multilingual VLM by Krutrim AI Labs


India is steadily progressing in the field of artificial intelligence, demonstrating notable growth and innovation. Krutrim AI Labs, a part of the Ola Group, is one of the organizations actively contributing to this progress. Krutrim recently introduced Chitrarth-1, a Vision Language Model (VLM) developed specifically for India’s diverse linguistic and cultural landscape. The model supports 10 major Indian languages, including Hindi, Tamil, Bengali, Telugu, along with English, effectively addressing the varied needs of the country. This article explores Chitrarth-1 and India’s expanding capabilities in AI.

What is Chitrarth?

Chitrarth (derived from Chitra: Image and Artha: Meaning) is a 7.5 billion-parameter VLM that combines cutting-edge language and vision capabilities. Developed to serve India’s linguistic diversity, it supports 10 prominent Indian languages – Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Kannada, Malayalam, Odia, and Assamese – alongside English.

This model is a testament to Krutrim’s mission: creating AI “for our country, of our country, and for our citizens.”

By leveraging a culturally rich and multilingual dataset, Chitrarth minimizes biases, enhances accessibility, and ensures robust performance across Indic languages and English. It stands as a step toward equitable AI advancements, making technology inclusive and representative for users in India and beyond.

Research behind Chitrarth-1 has been featured in prominent academic papers like “Chitrarth: Bridging Vision and Language for a Billion People” (NeurIPS) and “Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation” (Ninth Conference on Machine Translation).

Also Read: India’s AI Moment: Racing Against China and the U.S. in GenAI

Chitrarth Architecture and Parameters

Chitrarth builds on the Krutrim-7B LLM as its backbone, augmented by a vision encoder based on the SIGLIP (siglip-so400m-patch14-384) model. Its architecture includes:

  • A pretrained SIGLIP vision encoder to extract image features.
  • A trainable linear mapping layer that projects these features into the LLM’s token space.
  • Fine-tuning with instruction-following image-text datasets for enhanced multimodal performance.

This design ensures seamless integration of visual and linguistic data, enabling Chitrarth to excel in complex reasoning tasks.

Training Data and Methodology

Chitrarth’s training process unfolds in two stages, utilizing a diverse, multilingual dataset:

Stage 1: Adapter Pre-Training (PT)

  • Pre-trained on a carefully selected dataset, translated into multiple Indic languages using an open-source model.
  • Maintains a balanced split between English and Indic languages to ensure linguistic diversity and equitable performance.
  • Prevents bias toward any single language, optimizing for computational efficiency and robust capabilities.

Stage 2: Instruction Tuning (IT)

  • Fine-tuned on a complex instruction dataset to boost multimodal reasoning.
  • Incorporates an English-based instruction-tuning dataset and its multilingual translations.
  • Includes a vision-language dataset with academic tasks and culturally diverse Indian imagery, such as:
    • Prominent personalities
    • Monuments
    • Artwork
    • Culinary dishes
  • Features high-quality proprietary English text data, ensuring balanced representation across domains.

This two-step process equips Chitrarth to handle sophisticated multimodal tasks with cultural and linguistic nuance.

Also Read: Top 10 LLM That Are Bulit In India

Performance and Evaluation

Chitrarth has been rigorously evaluated against state-of-the-art VLMs like IDEFICS 2 (7B) and PALO 7B, consistently outperforming them on various benchmarks while remaining competitive on tasks like TextVQA and Vizwiz. It also surpasses LLaMA 3.2 11B Vision Instruct in key metrics.

BharatBench: A New Standard

Krutrim introduces BharatBench, a comprehensive evaluation suite for 10 under-resourced Indic languages across three tasks. Chitrarth’s performance on BharatBench sets a baseline for future research, showcasing its unique ability to handle all included languages. Below are sample results:

Language POPE LLaVA-Bench MMVet
Telugu 79.9 54.8 43.76
Hindi 78.68 51.5 38.85
Bengali 83.24 53.7 33.24
Malayalam 85.29 55.5 25.36
Kannada 85.52 58.1 46.19
English 87.63 67.9 30.49

To know more click here.

How to Access Chitrarth?

git clone https://github.com/ola-krutrim/Chitrarth.git  
conda create --name chitrarth python=3.10  
conda activate chitrarth  
cd Chitrarth  
pip install -e .  
python chitrarth/inference.py --model-path "krutrim-ai-labs/Chitrarth" --image-file "assets/govt_school.jpeg" --query "Explain the image."

Chitrarth-1 Examples

1. Image Analysis

2. Image Caption Generation

3. UI/UX Screen Analysis

Also Read: SUTRA-R0: India’s Leap into Advanced AI Reasoning

End Note

A part of the Ola Group, Krutrim is dedicated to creating the AI computing stack of tomorrow. Alongside Chitrarth, its offerings include GPU as a Service, AI Studio, Ola Maps, Krutrim Assistant, Language Labs, Krutrim Silicon, and Contact Center AI. With Chitrarth-1, Krutrim AI Labs sets a new standard for inclusive, culturally aware AI, paving the way for a more equitable technological future.

Stay updated with the latest happenings of the AI world with Analytics Vidhya News!

Hello, I am Nitika, a tech-savvy Content Creator and Marketer. Creativity and learning new things come naturally to me. I have expertise in creating result-driven content strategies. I am well versed in SEO Management, Keyword Operations, Web Content Writing, Communication, Content Strategy, Editing, and Writing.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments

Skip to toolbar