DeepSeek V3 LLM: An In-Depth Look

2 min read

•

Published:

28/01/2025

Ever since DeepSeek V3 LLM model has been released in December 2024, it's been on everyone's lips. Not only because of the model's impressive performance, claiming to beat Anthropic's Claude 3.5 or OpenAI's GPT 4o on number of tasks, but also because DeepSeek V3 did that allegedly at a fraction of the cost. In this blog post we will deep dive into what sets DeepSeek V3 apart from other large language models (LLMs) and what the Chinese open-source model has to offer to the global LLM community.

Introduction to DeepSeek Large Language Models

DeepSeek V3 LLM is a powerful open-source language model that builds on top of its previous versions: DeepSeek V1 and V2. DeepSeek LM models, just like LLaMA, use an auto-regressive transformer decoder model.

Architecture of DeepSeek V3 LLM

Now we will take an in depth look at the model architecture to better understand where its strenghts are coming from. DeepSense AI attributes cost-effectiveness training and efficient inference of its model to "an innovative Transformer architecture". Side note: for those who are unfamiliar with the intrecacies of the Transformer architecture we advise to read this seminal blog post. In and MoE and MLA for (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai et al., 2024) for cost-effective training.

Mixture-of-Experts (MoE) model

For cost-effective training it adopts MoE. Dense (e.g. Llama 3.1 405B)

"In order to achieve efficient training, we support the FP8 mixed precision training and implement comprehensive optimizations for the training framework."

Multi-head Latent Attention (MLA)

For efficient inference adopts Multi-head Latent Attention (MLA)

Multi-Token Prediction (MTP) objective

MTP is said to "enhance the overall performance on evaluation benchmarks"

speculative decoding for inference acceleration

FP8 mixed precision training framework

auxiliary-loss-free strategy for load balancing

Caching

Performance

outperforms other open-source models and achieves performance comparable to leading closed-source models

Compute

Related Insights

Cover Image for Planning and reasoning with Large Language Models - Are we there yet?