MiniMax-01: Advanced Language Model with 456B Parameters
MiniMax-01 represents a breakthrough in AI technology, featuring 456B total parameters with 45.9B activated per token. The model adopts a hybrid architecture combining Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE), enabling exceptional performance across various tasks.
Superior Performance on Benchmarks
MiniMax-01 demonstrates outstanding capabilities across multiple benchmarks, achieving 88.5% on MMLU, 75.7% on MMLU-Pro, and 94.8% on GSM8K. The model excels in mathematical reasoning, coding tasks, and complex problem-solving challenges.
Advanced Architecture
The model features an 80-layer architecture with hybrid attention mechanisms, where a softmax attention layer is positioned after every 7 lightning attention layers. With 64 attention heads and a head dimension of 128, MiniMax-01 achieves remarkable efficiency in processing and understanding complex inputs.
Long Context Capabilities
MiniMax-01 supports context lengths up to 4 million tokens during inference, with a training context length of 1 million tokens. This extensive context window enables effective processing of long documents and complex tasks requiring broad context understanding.
Mixture-of-Experts Architecture
The model employs 32 experts with a hidden dimension of 9216 and uses a top-2 routing strategy. This MoE architecture allows for efficient parameter activation and specialized processing of different types of inputs.
Practical Applications
From advanced mathematics and programming to complex reasoning tasks, MiniMax-01 offers comprehensive support across diverse domains. The model's extensive training and advanced architecture make it an invaluable tool for both academic and professional applications.