Skip to main content
This document provides guidance on deploying and running the MiniMax-M1 model using the Transformers library.

Applicable Models

This document applies to the following models. You only need to modify the model name during deployment. Note: Transformers-compatible models use repositories with the -hf suffix. Compared to versions without the suffix, only the config.json file differs, while the weight files remain identical.
The following deployment example uses MiniMax-M1-40k-hf

Environment Setup

  • Python: 3.9+
Recommend using a virtual environment (e.g., venv, conda, or uv) to avoid dependency conflicts. Install Transformers, Torch, and related dependencies with the following commands:
# Using CUDA 12.8
# Install with pip
pip install transformers torch accelerate --extra-index-url https://download.pytorch.org/whl/cu128

# Or install with uv
uv pip install transformers torch accelerate --torch-backend=auto

Running with Python

Ensure that all dependencies are correctly installed and CUDA drivers are properly configured.
The following example demonstrates how to load and run the MiniMax-M1 model with Transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

MODEL_PATH = "MiniMaxAI/MiniMax-M1-40k-hf"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

messages = [
    {"role": "user", "content": [{"type": "text", "text": "What is your favourite condiment?"}]},
    {"role": "assistant", "content": [{"type": "text", "text": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}]},
    {"role": "user", "content": [{"type": "text", "text": "Do you have mayonnaise recipes?"}]}
]

model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
generated_ids = model.generate(model_inputs, max_new_tokens=100, do_sample=True)

response = tokenizer.batch_decode(generated_ids)[0]

print(response)

Accelerating Inference with Flash Attention

Flash Attention is an efficient attention implementation that accelerates model inference.
Make sure your GPU supports Flash Attention, as some older GPUs may not be compatible.
First, install the flash_attn package:
# Install with pip
pip install flash_attn --no-build-isolation

# Or install with uv
uv pip install flash_attn --torch-backend=auto --no-build-isolation
To enable Flash Attention-2 when loading and running the MiniMax-M1 model, simply add the following parameters in from_pretrained:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16,        # Added parameter
    attn_implementation="flash_attention_2"  # Added parameter
)

Support

If you encounter issues while deploying the MiniMax models: We continuously improve the deployment experience on Transformers and welcome your feedback!