The field of natural language processing (NLP) has seen significant advancements in the past few years, with post-training techniques playing a crucial role in refining language models. While proprietary models like OpenAI’s GPT-4 and Anthropic’s Claude lead the market, open-source alternatives often lag due to limited access to post-training data and methodologies. Tülu 3 addresses this gap by introducing a fully open-source, state-of-the-art post-training framework, incorporating novel techniques and rigorous evaluation methods. In this article we will learn all about the Tülu 3 405b AI model including its training process and how to access the chatbot.
Learning Objectives
- Get familiar with the new open-source model – Tülu 3.
- Understand how the model works.
- Explore the four-stage post-training pipeline that Tülu 3 follows.
- Learn how to access the Tülu 3 405b AI chatbot.
- See how Tülu 3 performs in comparison to other existing models such as Llama 3.1 8B-Instruct.
This article was published as a part of the Data Science Blogathon.
What is Tülu 3?
Tülu 3 is a result of collaborative efforts from Allen Institute for AI and the University of Washington. Therefore, there is complete transparency in post-training datasets, methodologies, and evaluation frameworks. Built on Llama 3.1 base models, Tülu 3 surpasses the performance of other instruct-tuned open models, even competing with closed models like GPT-4o-mini and Claude 3.5-Haiku.
Tülu 3 is designed to refine the capabilities of open-source language models across multiple skill areas, including:
- Knowledge recall (e.g., MMLU benchmarks)
- Reasoning (e.g., BigBenchHard, DROP)
- Mathematics (e.g., GSM8K, MATH dataset)
- Coding (e.g., HumanEval, CodeAlpaca)
- Instruction following (e.g., IFEval, AlpacaEval 2)
- Safety & compliance (e.g., Tülu 3 Safety suite)
Tülu 3 Data
Data plays a critical role in training and refining language models. Tülu 3 introduces a diverse and well-curated dataset that combines publicly available sources with synthetically generated data.
Data Sources
The dataset includes:
- Publicly available datasets (e.g., FLAN v2, Open Assistant, No Robots, WildChat)
- Skill-specific datasets (e.g., NuminaMath, SciRIFF, OpenMathInstruct)
- Synthetically generated datasets using a persona-driven approach for skills like math, coding, and instruction following
- Noncompliance & safety data (e.g., WildJailbreak, CoCoNot, WildGuardMix)
Prompt Decontamination
A crucial step in ensuring model integrity is decontaminating training datasets to prevent test set contamination. The decontamination process involves 8-gram matching, ensuring that evaluation data does not overlap with training data. Several datasets (e.g., Evol CodeAlpaca, WildChat) were filtered and re-released with decontaminated samples.
Training Process
Tülu 3 follows a four-stage post-training pipeline:
- Data Curation: Prompts are curated from various datasets and synthetically generated for specific skills. A strict decontamination process is applied to prevent contamination in evaluation benchmarks.
- Supervised Finetuning (SFT): SFT trains the model using high-quality instruction-following data. Data mixing experiments were conducted to optimize performance across different tasks while maintaining generalization.
- Preference Finetuning (DPO): DPO is applied to fine-tune models using pairwise preference data. On-policy data is generated by comparing Tülu 3 completions against outputs from other models.
- Reinforcement Learning with Verifiable Rewards (RLVR): A novel RL-based approach, RLVR optimizes model performance by rewarding only verifiable correct answers. This method is particularly effective for tasks like math problem-solving and precise instruction-following.
Evaluation Process
Tülu 3 introduces Tülu 3 Eval, a standardized and transparent evaluation framework. The evaluation suite consists of:
- Development evaluations – Used to guide model improvement during training.
- Unseen evaluations – Held-out tests to measure overfitting and generalization.
- Safety evaluations – Assess compliance and robustness to adversarial prompts.
The evaluation suite is based on benchmarks like MMLU, GSM8K, BigBenchHard, HumanEval, and AlpacaEval 2. All evaluations and decontamination tools are open-sourced for reproducibility.
How to Get Started with Llama-3.1-Tulu-3-405B
Tülu 3 is an advanced instruction-following model family. Below are steps to start using the Llama-3.1-Tulu-3-405B model:
Step 1. Loading the Model with HuggingFace
To load the model using HuggingFace, use the following Python snippet:
from transformers import AutoModelForCausalLM
tulu_model = AutoModelForCausalLM.from_pretrained("allenai/Llama-3.1-Tulu-3-405B")
Step 2. Running with vLLM
As a Llama base model, the model can be easily served using:
vllm serve allenai/Llama-3.1-Tulu-3-405B --max_model_len=8192
Step 3. Using the Chat Template
The chat template for the model follows this format:
\nHow are you doing?\n\nI'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?
Or with expanded new lines:
How are you doing?
I’m just a computer program, so I don’t have feelings, but I’m functioning as expected. How can I assist you today?
Results & Comparisons
Tülu 3 achieves state-of-the-art results among open-weight models, outperforming models like Llama 3.1 Instruct, Mistral, and Qwen 2.5 Instruct. At the 70B model scale, Tülu 3 even rivals Claude 3.5 Haiku and GPT-4o-mini. Key results include:
- Tülu 3-70B surpasses Llama 3.1 70B Instruct and Nous Hermes 3
- Tülu 3-8B outperforms Qwen 2.5 7B and Mistral 8B
- Tülu 3-405B competes with DeepSeek V3 and GPT-4o (11-24)
Key Contributions of Tülu 3
Tülu 3 represents a major advancement in open language model post-training by introducing:
- Open-source datasets, code, and training recipes, enabling full transparency and reproducibility.
- Advanced decontamination strategies to prevent data leakage and ensure fair evaluations.
- Scalable preference tuning methodology, leveraging on-policy data for better alignment.
- Reinforcement Learning with Verifiable Rewards (RLVR), a novel RL training method that ensures correctness in verifiable tasks.
- Robust evaluation framework, providing reproducible benchmarks and safety assessments.
Conclusion
Tülu 3 establishes a new benchmark for open-weight language models, demonstrating that open-source models can rival proprietary solutions. With full access to model weights, training code, evaluation tools, and datasets, Tülu 3 lays the foundation for future advancements in post-training research.
Future work includes scaling the methodology to larger models, improving multimodal capabilities, and further optimizing RLVR techniques. The Tülu 3 release marks a significant milestone in the open AI community, enabling further innovation and research in large-scale language model post-training.
Key Takeaways
- Tülu 3 is an open-source post-training framework competing with proprietary models like GPT-4o-mini and Claude 3.5 Haiku.
- It follows a four-stage post-training pipeline: Data Curation, Supervised Fine-Tuning (SFT), Preference Fine-Tuning (DPO), and Reinforcement Learning with Verifiable Rewards (RLVR).
- The model is trained using diverse datasets, including public sources, skill-specific data, and synthetic persona-driven data, with strict decontamination to prevent test contamination.
- Tülu 3 outperforms several open-weight models, with the 70B version surpassing Llama 3.1 70B Instruct and Nous Hermes 3, and the 405B version competing with DeepSeek V3 and GPT-4o.
- The project promotes full transparency by open-sourcing datasets, training code, and evaluation tools, laying the foundation for future research in open-source AI.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Frequently Asked Questions
A. Tülu 3 is an open-source post-training framework designed to enhance language models through supervised finetuning, preference tuning, and reinforcement learning.
A. Reinforcement Learning with Verifiable Rewards (RLVR) optimizes models using rewards granted only for verifiably correct outputs, improving accuracy in structured tasks like mathematics and instruction-following.
A. Yes, all datasets, model weights, and training recipes are open-source, allowing users to fine-tune Tülu 3 for specific needs.
A. Tülu 3 competes closely with proprietary models like GPT-4o-mini and Claude 3.5-Haiku, achieving strong performance in various benchmarks.
A. You can find Tülu 3 models, code, and datasets on Hugging Face and GitHub.