As part of the ongoing #OpenSourceWeek, DeepSeek announced the release of DeepGEMM, a cutting-edge library designed for efficient FP8 General Matrix Multiplications (GEMMs). This library is tailored to support both dense and Mix-of-Experts (MoE) GEMMs, making it a powerful tool for V3/R1 training and inference. With DeepGEMM, we aim to push the boundaries of performance and efficiency in AI workloads, furthering our commitment to advancing open-source innovation in the field.
This release marks Day 3 of our Open Source Week celebrations, following the successful launches of DeepSeek FlashML on Day 1 and DeepSeek DeepEP on Day 2.
What is GEMM?
General Matrix Multiplication (GEMM) is a operation that takes two matrices and multiplies them by storing the result into a third matrix. It is a fundamental operation in Linear Algebra, widely used in various applications. Its formula is

GEMM is critical for optimizing the performance of the models. It is particularly useful in Deep learning, where it is mostly used in training and inference of neural networks.
This image depicts GEMM (General Matrix Multiplication), showing matrices A, B, and the resulting C. It highlights tiling, dividing matrices into smaller blocks (Mtile, Ntile, Ktile) for optimized cache usage. The blue and yellow tiles illustrate the multiplication process, contributing to the green “Block_m,n” tile in C. This technique improves performance by enhancing data locality and parallelism.
What is FP8?
FP8, or 8-bit floating point, is a format designed for high-performance computing which allows reduced precision as well as efficient representation of numerical data with real values. Huge datasets can result in high computational overload in machine learning and deep learning applications, this is where FP8 plays a vital role by reducing the computational complexity.
The FP8 format typically consists of:
- 1 sign bit
- 5 exponent bits
- 2 fraction bits
This compact representation allows for faster computations and reduced memory usage, making it ideal for training large models on modern hardware. The trade-off is a potential loss of precision, but in many deep learning scenarios, this loss is acceptable and can even lead to improved performance due to reduced computational load.
This image illustrates FP8 (8-bit Floating Point) formats, specifically E4M3 and E5M2, alongside FP16 and BF16 for comparison. It shows how FP8 representations allocate bits for sign, exponent, and mantissa, affecting precision and range. E4M3 uses 4 exponent bits and 3 mantissa bits, while E5M2 uses 5 and 2 respectively. The image highlights the trade-offs in precision and range between different floating-point formats, with FP8 offering reduced precision but lower memory footprint.
Need for DeepGEMM
DeepGEMM addresses the challenges in Matrix Multiplication by providing a lightweight, high-performance library that is easy to use and flexible enough to handle a variety of GEMM operations.
- Addresses a Critical Need: DeepGEMM fills a gap in the AI community by providing optimized FP8 GEMM.
- High-Performance and Lightweight: It offers fast computation with a small memory footprint.
- Supports Dense and MoE Layouts: It’s versatile, handling both standard and Mixture-of-Experts model architectures.
- Essential for Large-Scale AI: Its efficiency is crucial for training and running complex AI models.
- Optimizes MoE Architectures: DeepGEMM implements specialized GEMM types (contiguous-grouped, masked-grouped) for MoE efficiency.
- Enhances DeepSeek’s Models: It directly improves the performance of DeepSeek’s AI models.
- Benefits the Global AI Ecosystem: By offering a highly efficient tool, it aids AI developers worldwide.
Key Features of DeepGEMM
DeepGEMM stands out with its impressive features:
- High Performance: Achieving up to 1350+ FP8 TFLOPS on NVIDIA Hopper GPUs, DeepGEMM is optimized for speed and efficiency.
- Lightweight Design: The library has no heavy dependencies, making it as clean and straightforward as a tutorial. This method simplifies the process, ensuring that the focus remains on the core functionality without the distraction of elaborate setups.
- Just-In-Time Compilation: DeepGEMM’s approach, fully Just-In-Time (JIT) compilation, compiles all kernels at runtime, offering a streamlined user experience. By sidestepping the intricacies of complex configurations, users can concentrate on the actual implementation.
- Concise Core Logic: With core logic comprising approximately 300 lines of code, DeepGEMM outperforms many expert-tuned kernels across a wide range of matrix sizes. This compact design not only facilitates easier understanding and modification but also ensures high efficiency.
- Support for Diverse Layouts: The library supports both dense layouts and two types of MoE layouts, catering to different computational needs.
Performance Metrics
DeepGEMM has been rigorously tested across various matrix shapes, demonstrating significant speedups compared to existing implementations. Below is a summary of performance metrics:
M | N | K | Computation | Memory Bandwidth | Speedup |
---|---|---|---|---|---|
64 | 2112 | 7168 | 206 TFLOPS | 1688 GB/s | 2.7x |
128 | 7168 | 2048 | 510 TFLOPS | 2277 GB/s | 1.7x |
4096 | 4096 | 7168 | 1304 TFLOPS | 500 GB/s | 1.1x |
Table 1: Performance metrics showcasing DeepGEMM’s efficiency across various configurations.
Installation Guide
Getting started with DeepGEMM is straightforward. Here’s a quick guide to install the library:
Step 1: Prerequisites
- Hopper architecture GPUs (sm_90a)
- Python 3.8 or above
- CUDA 12.3 or above (recommended: 12.8 or above)
- PyTorch 2.1 or above
- CUTLASS 3.6 or above (can be cloned as a Git submodule)
Step 2: Clone the DeepGEMM Repository
Run
git clone --recursive [email protected]:deepseek-ai/DeepGEMM.git
Step 3: Install the Library
python setup.py install
Step 4: Import DeepGEMM in your Python Project
import deep_gemm
For detailed installation instructions and additional information, visit the DeepGEMM GitHub repository.
Conclusion
DeepGEMM stands out as a powerful FP8 GEMM library, known for its speed and ease of use, making it a great fit for tackling the challenges of advanced machine learning tasks. With its lightweight design, fast execution, and flexibility to work with different data layouts, DeepGEMM is a go-to tool for developers everywhere. Whether you’re working on training or inference, this library is built to simplify complex workflows, helping researchers and practitioners push the boundaries of what’s possible in AI.
Stay tuned to Analytics Vidhya Blog for our detailed analysis on DeepSeek’s Day 4 release!