diff --git a/docs/flagrelease_en/model_list.txt b/docs/flagrelease_en/model_list.txt index b4792605..54cf7dc0 100644 --- a/docs/flagrelease_en/model_list.txt +++ b/docs/flagrelease_en/model_list.txt @@ -30,6 +30,10 @@ FlagRelease/Emu3.5-FlagOS FlagRelease/GLM-4.5-FlagOS FlagRelease/GLM-5-FP8-FlagOS FlagRelease/GLM-5-ascend-FlagOS +FlagRelease/GLM-5.2-hygon-FlagOS +FlagRelease/GLM-5.2-metax-FlagOS +FlagRelease/GLM-5.2-mthreads-FlagOS +FlagRelease/GLM-5.2-zhenwu-FlagOS FlagRelease/HY-MT2-1.8B-ascend-FlagOS FlagRelease/HY-MT2-1.8B-hygon-FlagOS FlagRelease/HY-MT2-1.8B-metax-FlagOS @@ -79,8 +83,13 @@ FlagRelease/MiniMax-M2.7-iluvatar-FlagOS FlagRelease/MiniMax-M2.7-metax-FlagOS FlagRelease/MiniMax-M2.7-nvidia-FlagOS FlagRelease/MiniMax-M2.7-zhenwu-FlagOS +FlagRelease/MiniMax-M3-ascend-FlagOS +FlagRelease/MiniMax-M3-hygon-FlagOS +FlagRelease/MiniMax-M3-kunlunxin-FlagOS +FlagRelease/MiniMax-M3-metax-FlagOS FlagRelease/MiniMax-M3-mthreads-FlagOS FlagRelease/MiniMax-M3-nvidia-FlagOS +FlagRelease/MiniMax-M3-zhenwu-FlagOS FlagRelease/QwQ-32B-FlagOS-Cambricon FlagRelease/QwQ-32B-FlagOS-Iluvatar FlagRelease/QwQ-32B-FlagOS-Nvidia @@ -139,6 +148,7 @@ FlagRelease/RoboBrain2.5-8B-FlagOS FlagRelease/RoboBrain2.5-8B-ascend-FlagOS FlagRelease/Seed-OSS-36B-Instruct-FlagOS FlagRelease/TeleChat3-36B-Thinking-mthreads-FlagOS +FlagRelease/deepseek-r1-1.5b-nvidia-FlagOS FlagRelease/farm_molecular_representation-hygon-FlagOS FlagRelease/farm_molecular_representation-nvidia-FlagOS FlagRelease/gpt-oss-120b-FlagOS diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-hygon-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-hygon-FlagOS.md new file mode 100644 index 00000000..96f453ab --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-hygon-FlagOS.md @@ -0,0 +1,194 @@ +--- +license: apache-2.0 +language: +- zh +- en +--- + +# Introduction +Zhipu officially released its next-generation open-source flagship model **GLM-5.2**, the latest flagship targeting **Long Horizon Tasks**. Compared to its predecessor GLM-5.1, it achieves a significant leap in long-horizon task capabilities and is open-sourced under the **MIT License**. The **FlagOS Zhongzhi Community** completed multi-chip adaptation and inference deployment at the first opportunity, currently covering four chips: +**Moore Threads S5000, T-Head 810E, Metax C550 and Hygon DCU BW1000**. + +Developers can rapidly deploy via the FlagOS unified, open-source software stack; model files and deployment guides are simultaneously available on **ModelScope** and **HuggingFace**. GLM-5.2 is a model featuring a stable and usable **1M context window**, purpose-built for Long Horizon Tasks. Its core capabilities include: + +- **Solid 1M context**: Stably supports a 1,000,000-token context window for long-horizon workloads +- **Flexible advanced coding**: Enhanced coding capabilities with support for multiple inference effort levels to balance performance and latency +- **Improved architecture**: Introduces **IndexShare**, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at 1M context length; improves the MTP layer to support speculative decoding, increasing acceptance length by up to **20%** +- **Fully open-source**: MIT license, with no geographic restrictions + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Hygon** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | GLM-5.2-Nvidia-Origin | GLM-5.2-Hygon-FlagOS | +|--------------|-----------------------|----------------------| +| GPQA_Diamond | 85.85 | Evaluating | +| musr_generative | 69.2 | Evaluating | + +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 28.2.2, build 28.2.2-0ubuntu1~22.04.1 | +| Operating System | Ubuntu 22.04.4 LTS (Jammy Jellyfish) | + +## Operation Steps + +### Download FlagOS Image +```bash +docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-glm-5.2-hygon-tree_none-gems_5.0.2-vllm_0.20.0_das.dtk2604-plugin_0.2.0rc2.post1-cx_none-python_3.10.12-torch_2.10.0_das.opt1.dtk2604.20260325.g6b060a-pcp_dtk-25.04.4-dri:202606171534 +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/GLM-5.2-hygon-FlagOS --local_dir /data/GLM-5.2 +``` + +### Start the Container +```bash +docker run \ + --name flagos \ + --network=host \ + --ipc=host \ + --device=/dev/kfd \ + --device=/dev/mkfd \ + --device=/dev/dri \ + -v /opt/hyhal:/opt/hyhal \ + -v /data:/data \ + --group-add video \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -itd harbor.baai.ac.cn/flagrelease-public/flagrelease-glm-5.2-hygon-tree_none-gems_5.0.2-vllm_0.20.0_das.dtk2604-plugin_0.2.0rc2.post1-cx_none-python_3.10.12-torch_2.10.0_das.opt1.dtk2604.20260325.g6b060a-pcp_dtk-25.04.4-dri:202606171534 \ + /bin/bash +docker exec -it flagos /bin/bash +``` +### Start the Server +```bash +# In node 0 +export VLLM_PLUGINS=fl +VLLM_FL_FLAGOS_BLACKLIST='attention_backend,rotary_embedding,rms_norm,silu_and_mul,gelu_and_mul,grouped_topk,topk_softmax,invoke_fused_moe_triton_kernel,moe_align_block_size,moe_sum' \ +vllm serve /data/GLM-5.2 \ + --served-model-name "flagOS" \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 8 \ + --max-model-len 32768 \ + --trust-remote-code \ + --enforce-eager \ + --pipeline-parallel-size 4 \ + --nnodes 4 \ + --node-rank 0 \ + --master-addr + +# In node 1 +export VLLM_PLUGINS=fl +VLLM_FL_FLAGOS_BLACKLIST='attention_backend,rotary_embedding,rms_norm,silu_and_mul,gelu_and_mul,grouped_topk,topk_softmax,invoke_fused_moe_triton_kernel,moe_align_block_size,moe_sum' vllm serve /data/GLM-5.2 \ + --served-model-name "flagOS" \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 8 \ + --max-model-len 32768 \ + --trust-remote-code \ + --enforce-eager \ + --pipeline-parallel-size 4 \ + --nnodes 4 --node-rank 1 \ + --master-addr \ + --headless + +# In node 2 +export VLLM_PLUGINS=fl +VLLM_FL_FLAGOS_BLACKLIST='attention_backend,rotary_embedding,rms_norm,silu_and_mul,gelu_and_mul,grouped_topk,topk_softmax,invoke_fused_moe_triton_kernel,moe_align_block_size,moe_sum' vllm serve /data/GLM-5.2 \ + --served-model-name "flagOS" \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 8 \ + --max-model-len 32768 \ + --trust-remote-code \ + --enforce-eager \ + --pipeline-parallel-size 4 \ + --nnodes 4 --node-rank 2 \ + --master-addr \ + --headless +# In node 3 +export VLLM_PLUGINS=fl +VLLM_FL_FLAGOS_BLACKLIST='attention_backend,rotary_embedding,rms_norm,silu_and_mul,gelu_and_mul,grouped_topk,topk_softmax,invoke_fused_moe_triton_kernel,moe_align_block_size,moe_sum' vllm serve /data/GLM-5.2 \ + --served-model-name "flagOS" \ + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 8 \ + --max-model-len 32768 \ + --trust-remote-code \ + --enforce-eager \ + --pipeline-parallel-size 4 \ + --nnodes 4 --node-rank 3 \ + --master-addr \ + --headless +``` + +## Service Invocation +### Invocation Script +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "flagOS", + "messages": [{"role": "user", "content": "你好"}] + }' +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from ZhipuAI/GLM-5.2 and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-metax-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-metax-FlagOS.md new file mode 100644 index 00000000..6103978a --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-metax-FlagOS.md @@ -0,0 +1,159 @@ +--- +license: apache-2.0 +language: +- zh +- en +--- + +# Introduction +Zhipu officially released its next-generation open-source flagship model **GLM-5.2**, the latest flagship targeting **Long Horizon Tasks**. Compared to its predecessor GLM-5.1, it achieves a significant leap in long-horizon task capabilities and is open-sourced under the **MIT License**. The **FlagOS Zhongzhi Community** completed multi-chip adaptation and inference deployment at the first opportunity, currently covering four chips: +**Moore Threads S5000, T-Head 810E, Metax C550 and Hygon DCU BW1000**. + +Developers can rapidly deploy via the FlagOS unified, open-source software stack; model files and deployment guides are simultaneously available on **ModelScope** and **HuggingFace**. GLM-5.2 is a model featuring a stable and usable **1M context window**, purpose-built for Long Horizon Tasks. Its core capabilities include: + +- **Solid 1M context**: Stably supports a 1,000,000-token context window for long-horizon workloads +- **Flexible advanced coding**: Enhanced coding capabilities with support for multiple inference effort levels to balance performance and latency +- **Improved architecture**: Introduces **IndexShare**, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at 1M context length; improves the MTP layer to support speculative decoding, increasing acceptance length by up to **20%** +- **Fully open-source**: MIT license, with no geographic restrictions + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Metax** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | GLM-5.2-Nvidia-Origin | GLM-5.2-Metax-FlagOS | +|--------------|--------------------------------|----------------------| +| GPQA_Diamond | 85.85 | 84.34 | +| musr_generative | 69.2 | Evaluating | + +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 27.5.1, build 27.5.1-0ubuntu3~22.04.2 | +| Operating System | Ubuntu 22.04.5 LTS (Jammy Jellyfish) | + +## Operation Steps + +### Download FlagOS Image +```bash +docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-glm-5.2-metax-tree_0.5.1_metax3.0-gems_5.0.2-vllm_0.13.0_empty-plugin_0.1.0_vllm0.13.0-cx_0.8.0-python_3.12.11-torch_2.8.0_metax3.3.0.2-pcp_maca3.3.0.15-gpu_metax001-arc_amd64-driver_3.8.1:202606172035 +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/GLM-5.2-metax-FlagOS --local_dir /data/GLM-5.2 +``` + +### Start the Container +```bash +docker run -itd \ + --name flagos \ + --privileged \ + --network=host \ + --security-opt seccomp=unconfined \ + --security-opt apparmor=unconfined \ + --shm-size '100gb' \ + --ulimit memlock=-1 \ + --group-add video \ + --device=/dev/dri \ + --device=/dev/mxcd \ + --device=/dev/mem \ + --device=/dev/infiniband \ + -v /usr/local/:/usr/local/ \ + -v /data/:/data/ \ + harbor.baai.ac.cn/flagrelease-public/flagrelease-glm-5.2-metax-tree_0.5.1_metax3.0-gems_5.0.2-vllm_0.13.0_empty-plugin_0.1.0_vllm0.13.0-cx_0.8.0-python_3.12.11-torch_2.8.0_metax3.3.0.2-pcp_maca3.3.0.15-gpu_metax001-arc_amd64-driver_3.8.1:202606172035 \ + /bin/bash +docker exec -it flagos /bin/bash +``` +### Start the Server +This inference deployment requires 4 physical machines. All node startup scripts are located under /data/GLM-5.2/script/, with the filename prefix start_. +Full list of scripts: +- /data/GLM-5.2/script/start_node0_tp32_pytorch.sh +- /data/GLM-5.2/script/start_node1_tp32_pytorch.sh +- /data/GLM-5.2/script/start_node2_tp32_pytorch.sh +- /data/GLM-5.2/script/start_node3_tp32_pytorch.sh + +You need to modify the service startup scripts for the four machines according to your actual environment. +```bash +cd /data/GLM-5.2/script + +# Run on node0 +bash start_node0_tp32_pytorch.sh + +# Run on node1 +bash start_node1_tp32_pytorch.sh + +# Run on node2 +bash start_node2_tp32_pytorch.sh + +# Run on node3 +bash start_node3_tp32_pytorch.sh +``` + +## Service Invocation +### Invocation Script +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "glm52", + "messages": [{"role": "user", "content": "你好"}] +}' +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from ZhipuAI/GLM-5.2 and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-mthreads-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-mthreads-FlagOS.md new file mode 100644 index 00000000..8a6bbf75 --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-mthreads-FlagOS.md @@ -0,0 +1,212 @@ +--- +license: apache-2.0 +language: +- zh +- en +--- + +# Introduction +Zhipu officially released its next-generation open-source flagship model **GLM-5.2**, the latest flagship targeting **Long Horizon Tasks**. Compared to its predecessor GLM-5.1, it achieves a significant leap in long-horizon task capabilities and is open-sourced under the **MIT License**. The **FlagOS Zhongzhi Community** completed multi-chip adaptation and inference deployment at the first opportunity, currently covering four chips: +**Moore Threads S5000, T-Head 810E, Metax C550 and Hygon DCU BW1000**. + +Developers can rapidly deploy via the FlagOS unified, open-source software stack; model files and deployment guides are simultaneously available on **ModelScope** and **HuggingFace**. GLM-5.2 is a model featuring a stable and usable **1M context window**, purpose-built for Long Horizon Tasks. Its core capabilities include: + +- **Solid 1M context**: Stably supports a 1,000,000-token context window for long-horizon workloads +- **Flexible advanced coding**: Enhanced coding capabilities with support for multiple inference effort levels to balance performance and latency +- **Improved architecture**: Introduces **IndexShare**, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at 1M context length; improves the MTP layer to support speculative decoding, increasing acceptance length by up to **20%** +- **Fully open-source**: MIT license, with no geographic restrictions + + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Mthreads** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | GLM-5.2-Nvidia-Origin | GLM-5.2-Mthreads-FlagOS | +|--------------|--------------------------------|-------------------------| +| GPQA_Diamond | 85.85 | Evaluating | +| musr_generative | 69.2 | 67.2 | +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 27.5.1, build 9f9e405 | +| Operating System | 22.04.4 LTS (Jammy Jellyfish) | + +## Operation Steps + +### Download FlagOS Image +```bash +docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-glm-5.2-mthreads-sglang_0.5.11-plugin_0.1.0-tree_none-gems_5.0.2-vllm_none-cx_none-python_3.10.12-torch_2.9.0-pcp_musa4.3.5-driver_3.3.6-server:202606160848 +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/GLM-5.2-mthreads-FlagOS --local_dir /data/GLM-5.2 +``` + +### Start the Container +```bash +docker run -itd \ + --name flagos \ + --network host \ + --privileged \ + -v /data:/data \ + harbor.baai.ac.cn/flagrelease-public/flagrelease-glm-5.2-mthreads-sglang_0.5.11-plugin_0.1.0-tree_none-gems_5.0.2-vllm_none-cx_none-python_3.10.12-torch_2.9.0-pcp_musa4.3.5-driver_3.3.6-server:202606160848 \ + bash +docker exec -it flagos bash +``` +### Start the Server +In Rank 0 +```bash +# ====================== Rank 0 节点操作 ====================== +## Step 1: 加载虚拟环境并配置分布式通信、性能调试环境变量 +source /root/.virtualenvs/sglang-0.5.6/bin/activate +export TORCH_MCCL_ASYNC_ERROR_HANDLING=0 +export MCCL_SOCKET_IFNAME=bond0 +export GLOO_SOCKET_IFNAME=bond0 +export MCCL_TIMEOUT=14400 +export MCCL_IB_DISABLE=1 +export SGLANG_FLAGGEMS_RECORD=1 +export SGLANG_FLAGGEMS_LOG_PATH=/tmp/flaggems_op.txt +export SGLANG_FL_DISPATCH_DEBUG=1 +export TORCH_COMPILE_DISABLE=1 +export TRITON_CACHE_DIR=/root/triton_cache/ +export SGLANG_FL_FLAGOS_BLACKLIST=unique,sort,count_nonzero,cumsum,mm +export MUSA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" + +## Step 2: 后台启动SGLang推理服务(Rank0,流水线分片0) +# nohup后台运行,日志输出至/tmp/serve_glm_pp0.log,&放置后台 +nohup python3 -m sglang.launch_server \ + --model-path /data/GLM-5.2 \ + --tp-size 8 --pp-size 2 --nnodes 2 \ + --node-rank 0 \ + --dist-init-addr ":29500" \ + --host 0.0.0.0 --port 30000 \ + --served-model-name glm-5.2 \ + --tool-call-parser glm47 --reasoning-parser glm45 \ + --kv-cache-dtype fp8_e4m3 \ + --attention-backend triton \ + --cuda-graph-bs 1 2 4 6 8 12 16 20 24 32 40 48 \ + --chunked-prefill-size 2048 \ + --mem-fraction-static 0.85 \ + --trust-remote-code \ + --watchdog-timeout 3600 \ + > /tmp/serve_glm_pp0.log 2>&1 & + +## Step 3: 实时跟踪Rank0启动日志,观察就绪状态 +# 持续打印日志,出现就绪字段代表Rank0初始化完成 +tail -f /tmp/serve_glm_pp0.log + +# 就绪标识日志:[2026-xx-xx xx:xx:xx] The server is fired up and ready to roll! +# 完整集群初始化耗时约3~5分钟 +# 重要提示:Rank0启动后会阻塞等待Rank1节点建立分布式连接,打印Init torch distributed begin属于正常现象,此时去远端执行Rank1脚本即可 +``` +In Rank 1 +```bash +# ====================== Rank 1 远端节点(10.1.15.176)操作 ====================== +## Step 1: 进入推理容器,加载虚拟环境、统一分布式环境变量 +# 进入运行sglang的容器 +docker exec -it flagos bash +# 激活和Rank0完全一致的虚拟环境 +source /root/.virtualenvs/sglang-0.5.6/bin/activate +export TORCH_MCCL_ASYNC_ERROR_HANDLING=0 +export MCCL_SOCKET_IFNAME=bond0 +export GLOO_SOCKET_IFNAME=bond0 +export MCCL_TIMEOUT=14400 +export MCCL_IB_DISABLE=1 +export TORCH_COMPILE_DISABLE=1 +export SGLANG_FLAGGEMS_RECORD=1 +export SGLANG_FLAGGEMS_LOG_PATH=/tmp/flaggems_op.txt +export SGLANG_FL_DISPATCH_DEBUG=1 +export TRITON_CACHE_DIR=/root/triton_cache/ +export SGLANG_FL_FLAGOS_BLACKLIST=unique,sort,count_nonzero,cumsum,mm +export MUSA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" + +## Step 2: 后台启动SGLang推理服务(Rank1,流水线分片1) +nohup python3 -m sglang.launch_server \ + --model-path /data/GLM-5.2 \ + --tp-size 8 --pp-size 2 --nnodes 2 \ + --node-rank 1 \ + --dist-init-addr ":29500" \ + --host 0.0.0.0 --port 30000 \ + --served-model-name glm-5.2 \ + --tool-call-parser glm47 --reasoning-parser glm45 \ + --kv-cache-dtype fp8_e4m3 \ + --attention-backend triton \ + --cuda-graph-bs 1 2 4 6 8 12 16 20 24 32 40 48 \ + --chunked-prefill-size 2048 \ + --mem-fraction-static 0.85 \ + --trust-remote-code \ + --watchdog-timeout 3600 \ + > /tmp/serve_glm_pp1.log 2>&1 & + + +``` + +## Service Invocation +### Invocation Script +```bash +curl http://localhost:30000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "glm-5.2", + "messages": [{"role": "user", "content": "你好"}] + }' +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from ZhipuAI/GLM-5.2 and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-zhenwu-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-zhenwu-FlagOS.md new file mode 100644 index 00000000..ada86d57 --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_GLM-5.2-zhenwu-FlagOS.md @@ -0,0 +1,146 @@ +--- +license: apache-2.0 +language: +- zh +- en +--- + +# Introduction +Zhipu officially released its next-generation open-source flagship model **GLM-5.2**, the latest flagship targeting **Long Horizon Tasks**. Compared to its predecessor GLM-5.1, it achieves a significant leap in long-horizon task capabilities and is open-sourced under the **MIT License**. The **FlagOS Zhongzhi Community** completed multi-chip adaptation and inference deployment at the first opportunity, currently covering four chips: +**Moore Threads S5000, T-Head 810E, Metax C550 and Hygon DCU BW1000**. + +Developers can rapidly deploy via the FlagOS unified, open-source software stack; model files and deployment guides are simultaneously available on **ModelScope** and **HuggingFace**. GLM-5.2 is a model featuring a stable and usable **1M context window**, purpose-built for Long Horizon Tasks. Its core capabilities include: + +- **Solid 1M context**: Stably supports a 1,000,000-token context window for long-horizon workloads +- **Flexible advanced coding**: Enhanced coding capabilities with support for multiple inference effort levels to balance performance and latency +- **Improved architecture**: Introduces **IndexShare**, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at 1M context length; improves the MTP layer to support speculative decoding, increasing acceptance length by up to **20%** +- **Fully open-source**: MIT license, with no geographic restrictions + + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Zhenwu** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | GLM-5.2-Nvidia-Origin | GLM-5.2-Zhenwu-FlagOS | +|--------------|-------------------------------|-----------------------| +| GPQA_Diamond | 85.85 | 84.62 | +| musr_generative | 69.2 | Evaluating | + +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 28.1.0, build 4d8c241 | +| Operating System | Ubuntu 24.04.2 LTS | + +## Operation Steps +The image for this task is exported from Alibaba Cloud PAI and can be used on Alibaba Cloud EAS and DSW, both of which are container‑based resource services. For detailed instructions on how to use this image, please contact the PAI platform support team. The task released by BAAI is developed based on the container environment launched via the PAI platform. + +### Download FlagOS Image +```bash +docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-glm5.2-zhenwu-tree_none-gems_5.0.2-vllm_0.20.2_empty-plugin_0.2.0rc2.post1_g672dedc42-cx_none-python_3.12.3-torch_2.10.0-pcp_hggc13.0-gpu_pp001-arc_amd64-driver_1.3.2-d7f5a2:202606161003 +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/GLM-5.2-zhenwu-FlagOS --local_dir /data/GLM-5.2 +``` + +### Start the Server +```bash +export NCCL_ALGO=Ring # 跨机用 Ring 算法更稳定 +export NCCL_MIN_NCHANNELS=16 # 增加并行通道数(默认8) +export NCCL_NTHREADS=512 # NCCL 线程数 +export NCCL_IB_GID_INDEX=3 # RoCE 网络优化 +export NCCL_SOCKET_IFNAME=eth0 + +# In node 0 +VLLM_RPC_TIMEOUT=3000 NCCL_DEBUG=INFO VLLM_PLUGINS=fl nohup vllm serve /data/GLM-5.2 \ +--served-model-name "glm5.2" --host 0.0.0.0 --port 8000 \ +--tensor-parallel-size 32 \ +--nnodes 2 --node-rank 0 \ +--master-addr 10.11.0.3 --master-port 29500 \ +--trust-remote-code --enforce-eager \ +--max-model-len 32768 --gpu-memory-utilization 0.95 \ +--max-num-batched-tokens 8192 \ +> glm5_2.log 2>&1 & + +# In node 1 +VLLM_RPC_TIMEOUT=3000 NCCL_DEBUG=INFO VLLM_PLUGINS=fl nohup vllm serve /data/GLM-5.2 \ +--served-model-name "glm5.2" --host 0.0.0.0 --port 8000 \ +--tensor-parallel-size 32 \ +--nnodes 2 --node-rank 1 \ +--master-addr 10.11.0.3 --master-port 29500 \ +--trust-remote-code --enforce-eager --headless \ +--max-model-len 32768 --gpu-memory-utilization 0.95 \ +--max-num-batched-tokens 8192 \ +> glm5_2-2.log 2>&1 & +``` + +## Service Invocation +### Invocation Script +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "glm5.2", + "messages": [{"role": "user", "content": "你好"}] + }' +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from ZhipuAI/GLM-5.2 and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-ascend-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-ascend-FlagOS.md new file mode 100644 index 00000000..5ca863b9 --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-ascend-FlagOS.md @@ -0,0 +1,128 @@ +--- +base_model: +- "" +language: +- zh +- en +license: apache-2.0 +--- + +# Introduction +MiniMax M3, released on June 1st, is the first Chinese model to simultaneously deliver frontier coding/agentic capabilities, 1M ultra-long context, and native multimodality — and the only open-source model in the world with all three. The core innovation is a proprietary MSA sparse attention architecture: at 1M context, compute per token is just 1/20th of the previous generation, with 9× prefilling speedup and 15× decoding speedup. On SWE-Bench Pro, M3 scores 59.0%, surpassing GPT-5.5 and Gemini 3.1 Pro, and approaching Opus 4.7; on the multimodal benchmark OmniDocBench, it also outperforms Gemini 3.1 Pro. In real-world tests, M3 autonomously ran for nearly 12 hours to successfully reproduce an ICLR award-winning paper, and within ~24 hours pushed FP8 GEMM kernel utilization from 7.6% to 71.3% — a 9.4× speedup. + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Ascend** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | MiniMax-M3-Nvidia-Origin | MiniMax-M3-Ascend-FlagOS | +|--------------|------------------------|------------------------| +| GPQA_Diamond | 86.36 | 75.56 | + +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 20.10.8, build 3967b7d | +| Operating System | Linux 5.10.0-216.0.0.115.oe2203sp4.aarch64 | + +## Operation Steps + +### Download FlagOS Image +```bash +docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-minimax-m3-ascend-tree_none-gems_5.0.2-vllm_none-plugin_none-cx_none-python_3.11.14-torch_npu_2.8.0.post2-pcp_cann8.5.0-gpu_ascend001-arc_arm64-driver_25.2.0:202606051452 +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/MiniMax-M3-ascend-FlagOS --local_dir /data/MiniMax-M3 +``` + +### Start the Container +```bash +docker run -dit \ + --name flagos \ + --privileged \ + --network=host --ipc=host --shm-size=64g \ + --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \ + --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \ + --device=/dev/davinci_manager \ + --device=/dev/hisi_hdc \ + --volume /usr/local/sbin:/usr/local/sbin \ + --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \ + --volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \ + --volume /etc/ascend_install.info:/etc/ascend_install.info \ + --volume /var/queue_schedule:/var/queue_schedule \ + --entrypoint=bash \ + -v /data:/data \ + harbor.baai.ac.cn/flagrelease-public/flagrelease-minimax-m3-ascend-tree_none-gems_5.0.2-vllm_none-plugin_none-cx_none-python_3.11.14-torch_npu_2.8.0.post2-pcp_cann8.5.0-gpu_ascend001-arc_arm64-driver_25.2.0:202606051452 +docker eexec -it flagos bash +``` + +## Service Invocation +### Invocation Script +```bash +cd /workspace +# in node1 +bash run_dual_rank0.sh + +# in node 2 +bash run_dual_rank1.sh +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from MiniMaxAI/MiniMax-M3 and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt + diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-hygon-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-hygon-FlagOS.md new file mode 100644 index 00000000..812c2ed5 --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-hygon-FlagOS.md @@ -0,0 +1,124 @@ +--- +base_model: +- "" +language: +- zh +- en +license: apache-2.0 +--- + +# Introduction +MiniMax M3, released on June 1st, is the first Chinese model to simultaneously deliver frontier coding/agentic capabilities, 1M ultra-long context, and native multimodality — and the only open-source model in the world with all three. The core innovation is a proprietary MSA sparse attention architecture: at 1M context, compute per token is just 1/20th of the previous generation, with 9× prefilling speedup and 15× decoding speedup. On SWE-Bench Pro, M3 scores 59.0%, surpassing GPT-5.5 and Gemini 3.1 Pro, and approaching Opus 4.7; on the multimodal benchmark OmniDocBench, it also outperforms Gemini 3.1 Pro. In real-world tests, M3 autonomously ran for nearly 12 hours to successfully reproduce an ICLR award-winning paper, and within ~24 hours pushed FP8 GEMM kernel utilization from 7.6% to 71.3% — a 9.4× speedup. + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Hygon** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | MiniMax-M3-Nvidia-Origin | MiniMax-M3-Hygon-FlagOS | +|--------------|--------------------------|-----------------------| +| GPQA_Diamond | 86.36 | 77.50 | + +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 28.2.2, build 28.2.2-0ubuntu1~22.04.1 | +| Operating System | Ubuntu 22.04.4 LTS (Jammy Jellyfish) | + +## Operation Steps + +### Download FlagOS Image +```bash +docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-minimax-m3-hygon-tree_0.5.0_hcu3.0-gems_5.0.2-vllm_0.15.1-plugin_none-cx_none-python_3.10.12-torch_2.9.0_das.opt1.dtk2604.20260206.g275d08c2-pcp_cudanone-gpu_nvidia003-arc_amd64-driver:202606051553 +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/MiniMax-M3-hygon-FlagOS --local_dir /data/MiniMax-M3 +``` + +### Start the Container +```bash +docker run \ + --name flagos \ + --network=host \ + --ipc=host \ + --device=/dev/kfd \ + --device=/dev/mkfd \ + --device=/dev/dri \ + -v /opt/hyhal:/opt/hyhal \ + -v /root/perfxlab:/workspace \ + -v /data:/data \ + -v /baai/model-share:/baai/model-share \ + --group-add video \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + -itd \ + harbor.baai.ac.cn/flagrelease-public/flagrelease-minimax-m3-hygon-tree_0.5.0_hcu3.0-gems_5.0.2-vllm_0.15.1-plugin_none-cx_none-python_3.10.12-torch_2.9.0_das.opt1.dtk2604.20260206.g275d08c2-pcp_cudanone-gpu_nvidia003-arc_amd64-driver:202606051553 +docker exec -it flagos /bin/bash +``` + +## Service Invocation +### Invocation Script +```bash +cd /root/M3pytorch_code +bash run_inference.sh +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from MiniMaxAI/MiniMax-M3 and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt + diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-kunlunxin-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-kunlunxin-FlagOS.md new file mode 100644 index 00000000..628b4ffc --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-kunlunxin-FlagOS.md @@ -0,0 +1,118 @@ +--- +base_model: +- "" +language: +- zh +- en +license: apache-2.0 +--- + +# Introduction +MiniMax M3, released on June 1st, is the first Chinese model to simultaneously deliver frontier coding/agentic capabilities, 1M ultra-long context, and native multimodality — and the only open-source model in the world with all three. The core innovation is a proprietary MSA sparse attention architecture: at 1M context, compute per token is just 1/20th of the previous generation, with 9× prefilling speedup and 15× decoding speedup. On SWE-Bench Pro, M3 scores 59.0%, surpassing GPT-5.5 and Gemini 3.1 Pro, and approaching Opus 4.7; on the multimodal benchmark OmniDocBench, it also outperforms Gemini 3.1 Pro. In real-world tests, M3 autonomously ran for nearly 12 hours to successfully reproduce an ICLR award-winning paper, and within ~24 hours pushed FP8 GEMM kernel utilization from 7.6% to 71.3% — a 9.4× speedup. + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Kunlunxin** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | MiniMax-M3-Nvidia-Origin | MiniMax-M3-Kunlunxin-FlagOS | +|--------------|--------------------------|-----------------------------| +| GPQA_Diamond | 86.36 | 22.3 | + +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 28.2.2, build e6534b4 | +| Operating System | 22.04.4 LTS (Jammy Jellyfish) | + +## Operation Steps + +### Download FlagOS Image +```bash +docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-minimax-m3-kunlunxin-tree_0.5.1_xpu3.0-gems_5.0.2-vllm_none-plugin_none-cx_none-python_3.10.15-torch_2.5.1_cu118-pcp_xpu-rtnone-gpu_kunlunxin001-arc_amd64-driver_515.58:202606051936 +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/MiniMax-M3-kunlunxin-FlagOS --local_dir /data/MiniMax-M3 +``` + +### Start the Container +```bash +docker run -itd \ + --security-opt=seccomp=unconfined \ + --cap-add=SYS_PTRACE \ + --ulimit=memlock=-1 --ulimit=nofile=120000 --ulimit=stack=67108864 \ + --shm-size=128G \ + --privileged \ + --net=host \ + --name zylm3 \ + -v /data:/data \ + -w /workspace \ + harbor.baai.ac.cn/flagrelease-public/flagrelease-minimax-m3-kunlunxin-tree_0.5.1_xpu3.0-gems_5.0.2-vllm_none-plugin_none-cx_none-python_3.10.15-torch_2.5.1_cu118-pcp_xpu-rtnone-gpu_kunlunxin001-arc_amd64-driver_515.58:202606051936 +``` + +## Service Invocation +### Invocation Script +```bash +cd /root/M3pytorch_code +bash run_inference.sh +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from MiniMaxAI/MiniMax-M3 and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt + diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-metax-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-metax-FlagOS.md new file mode 100644 index 00000000..83131807 --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-metax-FlagOS.md @@ -0,0 +1,127 @@ +--- +base_model: +- "" +language: +- zh +- en +license: apache-2.0 +--- + +# Introduction +MiniMax M3, released on June 1st, is the first Chinese model to simultaneously deliver frontier coding/agentic capabilities, 1M ultra-long context, and native multimodality — and the only open-source model in the world with all three. The core innovation is a proprietary MSA sparse attention architecture: at 1M context, compute per token is just 1/20th of the previous generation, with 9× prefilling speedup and 15× decoding speedup. On SWE-Bench Pro, M3 scores 59.0%, surpassing GPT-5.5 and Gemini 3.1 Pro, and approaching Opus 4.7; on the multimodal benchmark OmniDocBench, it also outperforms Gemini 3.1 Pro. In real-world tests, M3 autonomously ran for nearly 12 hours to successfully reproduce an ICLR award-winning paper, and within ~24 hours pushed FP8 GEMM kernel utilization from 7.6% to 71.3% — a 9.4× speedup. + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Metax** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | MiniMax-M3-Nvidia-Origin | MiniMax-M3-Metax-FlagOS | +|--------------|--------------------------|-----------------------| +| GPQA_Diamond | 86.36 | 78.24 | + +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 27.5.1, build 27.5.1-0ubuntu3~22.04.2 | +| Operating System | Ubuntu 22.04.5 LTS (Jammy Jellyfish) | + +## Operation Steps + +### Download FlagOS Image +```bash +docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-minimax-m3-metax-tree_0.5.1_metax3.0-gems_5.0.2-vllm_0.13.0-plugin_none-cx_none-python_3.12.11-torch_2.8.0_metax3.3.0.2-pcp_maca3.3.0.15-gpu_metax001-arc_amd64-driver_3.3.12:202606051553 + +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/MiniMax-M3-metax-FlagOS --local_dir /data/MiniMax-M3 +``` + +### Start the Container +```bash +docker run -itd \ + --name flagos \ + --privileged \ + --network=host \ + --security-opt seccomp=unconfined \ + --security-opt apparmor=unconfined \ + --shm-size '100gb' \ + --ulimit memlock=-1 \ + --group-add video \ + --device=/dev/dri \ + --device=/dev/mxcd \ + --device=/dev/mem \ + --device=/dev/infiniband \ + -v /usr/local/:/usr/local/ \ + -v /data/:/data/ \ + harbor.baai.ac.cn/flagrelease-public/flagrelease-minimax-m3-metax-tree_0.5.1_metax3.0-gems_5.0.2-vllm_0.13.0-plugin_none-cx_none-python_3.12.11-torch_2.8.0_metax3.3.0.2-pcp_maca3.3.0.15-gpu_metax001-arc_amd64-driver_3.3.12:202606051553 \ + /bin/bash +docker exec -it flagos /bin/bash +``` + +## Service Invocation +### Invocation Script +```bash +cd /root/M3pytorch_code +bash run_inference.sh +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from MiniMaxAI/MiniMax-M3 and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt + + diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-mthreads-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-mthreads-FlagOS.md index a34a15bd..6548809f 100644 --- a/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-mthreads-FlagOS.md +++ b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-mthreads-FlagOS.md @@ -1,12 +1,9 @@ --- -base_model: -- "" +license: apache-2.0 language: - zh - en -license: apache-2.0 --- - # Introduction MiniMax M3, released on June 1st, is the first Chinese model to simultaneously deliver frontier coding/agentic capabilities, 1M ultra-long context, and native multimodality — and the only open-source model in the world with all three. The core innovation is a proprietary MSA sparse attention architecture: at 1M context, compute per token is just 1/20th of the previous generation, with 9× prefilling speedup and 15× decoding speedup. On SWE-Bench Pro, M3 scores 59.0%, surpassing GPT-5.5 and Gemini 3.1 Pro, and approaching Opus 4.7; on the multimodal benchmark OmniDocBench, it also outperforms Gemini 3.1 Pro. In real-world tests, M3 autonomously ran for nearly 12 hours to successfully reproduce an ICLR award-winning paper, and within ~24 hours pushed FP8 GEMM kernel utilization from 7.6% to 71.3% — a 9.4× speedup. @@ -20,9 +17,9 @@ MiniMax M3, released on June 1st, is the first Chinese model to simultaneously d # Evaluation Results ## Benchmark Result | Metrics | MiniMax-M3-Nvidia-Origin | MiniMax-M3-Mthreads-FlagOS | -|--------------|-------------------------------|--------------------------------------| -| GPQA_Diamond | 0.8636 | 0.8182 | - +|--------------|-------------------------------|----------------------------| +| GPQA_Diamond | 86.36 | 83.68 | +| ERQA | 52.25 | 53.75 | # User Guide Environment Setup @@ -47,15 +44,16 @@ modelscope download --model FlagRelease/MiniMax-M3-mthreads-FlagOS --local_dir / ### Start the Container ```bash docker run -dit \ - --name flagos \ - --privileged \ - --ipc host \ - --network host \ - --shm-size 64g \ - --env MTHREADS_VISIBLE_DEVICES=all \ - -v /data:/data \ - harbor.baai.ac.cn/flagrelease-public/flagrelease-minimaxm3-mthreads-tree_0.5.2-gems_5.0.2-sglang_0.5.11-plugin_01.0-cx_none-python_3.10.12-torch_2.9.0-pcp_musa4.3.5-gpu_mthreads001-arc_amd64-driver_3.3.6-server:202606121704 \ - sleep infinity + --name flagos \ + --privileged \ + --ipc host \ + --network host \ + --shm-size 64g \ + --env MTHREADS_VISIBLE_DEVICES=all \ + -v /data:/data \ + harbor.baai.ac.cn/flagrelease-public/flagrelease-minimaxm3-mthreads-tree_0.5.2-gems_5.0.2-sglang_0.5.11-plugin_01.0-cx_none-python_3.10.12-torch_2.9.0-pcp_musa4.3.5-gpu_mthreads001-arc_amd64-driver_3.3.6-server:202606121704 \ + sleep infinity +docker exec -it flagos bash ``` ### Start the Server ```bash @@ -69,7 +67,7 @@ SGLANG_FL_DISPATCH_LOG=/tmp/flaggems_dispatch.log nohup python -m sglang.launch_ --model-path /data/MiniMax-M3 \ --tp-size 8 --pp-size 2 \ --nnodes 2 --node-rank 0 \ ---dist-init-addr 10.1.15.176:29500 \ +--dist-init-addr :29500 \ --host 0.0.0.0 --port 30000 \ --page-size 1 --disable-cuda-graph --disable-piecewise-cuda-graph \ --trust-remote-code --watchdog-timeout 3600 --mem-fraction-static 0.75 --max-running-requests 1 \ @@ -80,7 +78,7 @@ SGLANG_FL_DISPATCH_LOG=/tmp/flaggems_dispatch.log nohup python -m sglang.launch_ --model-path /data/MiniMax-M3 \ --tp-size 8 --pp-size 2 \ --nnodes 2 --node-rank 1 \ ---dist-init-addr 10.1.15.176:29500 \ +--dist-init-addr :29500 \ --host 0.0.0.0 --port 30000 \ --page-size 1 --disable-cuda-graph --disable-piecewise-cuda-graph \ --trust-remote-code --watchdog-timeout 3600 --mem-fraction-static 0.75 --max-running-requests 1 \ diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-nvidia-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-nvidia-FlagOS.md index 3762ddda..28afacd5 100644 --- a/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-nvidia-FlagOS.md +++ b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-nvidia-FlagOS.md @@ -1,10 +1,8 @@ --- -base_model: -- "" +license: apache-2.0 language: - zh - en -license: apache-2.0 --- # Introduction @@ -20,9 +18,9 @@ MiniMax M3, released on June 1st, is the first Chinese model to simultaneously d # Evaluation Results ## Benchmark Result | Metrics | MiniMax-M3-Nvidia-Origin | MiniMax-M3-Nvidia-FlagOS | -|--------------|-------------------------------|-------------------------------------| -| GPQA_Diamond | 0.8636 | 0.8283 | - +|--------------|-------------------------------|--------------------------| +| GPQA_Diamond | 86.36 | 84.77 | +| ERQA | 52.25 | 52.75 | # User Guide Environment Setup diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-zhenwu-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-zhenwu-FlagOS.md new file mode 100644 index 00000000..33ae8af6 --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_MiniMax-M3-zhenwu-FlagOS.md @@ -0,0 +1,105 @@ +--- +base_model: +- "" +language: +- zh +- en +license: apache-2.0 +--- + +# Introduction +MiniMax M3, released on June 1st, is the first Chinese model to simultaneously deliver frontier coding/agentic capabilities, 1M ultra-long context, and native multimodality — and the only open-source model in the world with all three. The core innovation is a proprietary MSA sparse attention architecture: at 1M context, compute per token is just 1/20th of the previous generation, with 9× prefilling speedup and 15× decoding speedup. On SWE-Bench Pro, M3 scores 59.0%, surpassing GPT-5.5 and Gemini 3.1 Pro, and approaching Opus 4.7; on the multimodal benchmark OmniDocBench, it also outperforms Gemini 3.1 Pro. In real-world tests, M3 autonomously ran for nearly 12 hours to successfully reproduce an ICLR award-winning paper, and within ~24 hours pushed FP8 GEMM kernel utilization from 7.6% to 71.3% — a 9.4× speedup. + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Zhenwu** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | MiniMax-M3-Nvidia-Origin | MiniMax-M3-Zhenwu-FlagOS | +|--------------|--------------------------|--------------------------| +| GPQA_Diamond | 86.36 | 73.08 | + +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 28.1.0, build 4d8c241 | +| Operating System | Ubuntu 24.04.2 LTS | + +## Operation Steps +The image for this task is exported from Alibaba Cloud PAI and can be used on Alibaba Cloud EAS and DSW, both of which are container‑based resource services. For detailed instructions on how to use this image, please contact the PAI platform support team. The task released by BAAI is developed based on the container environment launched via the PAI platform. + +### Download FlagOS Image +```bash +docker pull harbor.baai.ac.cn/flagrelease-public/flagrelease-minimax-m3-zhenwu-tree_none-gems_5.0.1rc0-vllm_0.13.1.dev0_g72506c983.d20260218-plugin_0.0.0-cx_none-python_3.12.3-torch_2.9.0-pcp_hggc13.0-gpu_pp001-arc_amd64-driver_1.3.2-d7f5a2:202606052018 +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/MiniMax-M3-zhenwu-FlagOS --local_dir /data/MiniMax-M3 +``` + +## Service Invocation +### Invocation Script +```bash +cd /root +USE_FLAGGEMS=1 python inference_bf16.py --num_gpus 16 --prompt "你好" + +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from MiniMaxAI/MiniMax-M3 and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt + diff --git a/docs/flagrelease_en/model_readmes/FlagRelease_deepseek-r1-1.5b-nvidia-FlagOS.md b/docs/flagrelease_en/model_readmes/FlagRelease_deepseek-r1-1.5b-nvidia-FlagOS.md new file mode 100644 index 00000000..371a286a --- /dev/null +++ b/docs/flagrelease_en/model_readmes/FlagRelease_deepseek-r1-1.5b-nvidia-FlagOS.md @@ -0,0 +1,108 @@ +# Introduction +DeepSeek-R1-Distill-Qwen-1.5B 是 DeepSeek 团队推出的轻量级强推理模型,核心是将超大推理模型 DeepSeek-R1 的思维能力,通过知识蒸馏 “浓缩” 到 Qwen2.5-Math-1.5B 基座中,实现1.5B 参数、本地可部署、数学 / 代码推理突出的综合优势。 + +### Integrated Deployment +- Out-of-the-box inference scripts with pre-configured hardware and software parameters +- Released **FlagOS-Nvidia** container image supporting deployment within minutes +### Consistency Validation +- Rigorously evaluated through benchmark testing: Performance and results from the FlagOS software stack are compared against native stacks on multiple public. + + +# Evaluation Results +## Benchmark Result +| Metrics | DeepSeek-R1-Distill-Qwen-1.5B-Nvidia-Origin | DeepSeek-R1-Distill-Qwen-1.5B-Nvidia-FlagOS | +|--------------|--------------------------------|--------------------------------------| +| GPQA_Diamond | - | - | +| ERQA | - | - | +| Aime24 | - | - | + +# User Guide +Environment Setup + +| Item | Version | +|------------------|----------------------| +| Docker Version | Docker version 24.0.0, build 98fdcd7 | +| Operating System | 22.04.4 LTS (Jammy Jellyfish) | + +## Operation Steps + +### Download FlagOS Image +```bash +harbor.baai.ac.cn/external-cooperation/deepseek-t1-distill-qwen-1.5b-nvidia-tree_0.5.0-gems_0.5.1rc0_vllm_0.13.0-plugin_v0.1.0_vllm0.13.0-cx_none-python_3.12.3-torch_2.9.0.dev20250804_cu128-pcp_cuda12.9-gpu_nvidia003-arc_amd64-driver_570.124.06:2606171415 bash +``` + +### Download Open-source Model Weights +```bash +pip install modelscope +modelscope download --model FlagRelease/DeepSeek-R1-Distill-Qwen-1.5B --local_dir /data/DeepSeek-R1-Distill-Qwen-1.5B +``` + +### Start the Container +```bash +docker run -it --name ds_check --gpus all -v /mnt/data:/data --network host harbor.baai.ac.cn/external-cooperation/deepseek-t1-distill-qwen-1.5b-nvidia-tree_0.5.0-gems_0.5.1rc0_vllm_0.13.0-plugin_v0.1.0_vllm0.13.0-cx_none-python_3.12.3-torch_2.9.0.dev20250804_cu128-pcp_cuda12.9-gpu_nvidia003-arc_amd64-driver_570.124.06:2606171415 bash +``` +### Start the Server +```bash +CUDA_VISIBLE_DEVICES=6 VLLM_PLUGINS=fl USE_FLAGGEMS=1 VLLM_FL_ALLOW_VENDORS=cuda VLLM_FL_FLAGOS_WHITELIST=embedding,rms_norm,addmm,rotary_embedding,silu_and_mul,gather,cos vllm serve --model /data/vllm-plugin-fl/deepseek-r1-1.5b --served-model-name deepseek-r1-1.5b --port 46840 --enforce-eager --max-model-len 8192 +``` + +## Service Invocation +### Invocation Script +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "flagOS", + "messages": [{"role": "user", "content": "你好"}] + }' +``` + + +### AnythingLLM Integration Guide + +#### 1. Download & Install + +- Visit the official site: https://anythingllm.com/ +- Choose the appropriate version for your OS (Windows/macOS/Linux) +- Follow the installation wizard to complete the setup + +#### 2. Configuration + +- Launch AnythingLLM +- Open settings (bottom left, fourth tab) +- Configure core LLM parameters +- Click "Save Settings" to apply changes + +#### 3. Model Interaction + +- After model loading is complete: +- Click **"New Conversation"** +- Enter your question (e.g., “Explain the basics of quantum computing”) +- Click the send button to get a response +# Technical Overview +**FlagOS** is a fully open-source system software stack designed to unify the "model–system–chip" layers and foster an open, collaborative ecosystem. It enables a “develop once, run anywhere” workflow across diverse AI accelerators, unlocking hardware performance, eliminating fragmentation among vendor-specific software stacks, and substantially lowering the cost of porting and maintaining AI workloads. With core technologies such as the **FlagScale**, together with vllm-plugin-fl, distributed training/inference framework, **FlagGems** universal operator library, **FlagCX** communication library, and **FlagTree** unified compiler, the **FlagRelease** platform leverages the **FlagOS** stack to automatically produce and release various combinations of \. This enables efficient and automated model migration across diverse chips, opening a new chapter for large model deployment and application. +## FlagGems +FlagGems is a high-performance, generic operator libraryimplemented in [Triton](https://github.com/openai/triton) language. It is built on a collection of backend-neutralkernels that aims to accelerate LLM (Large-Language Models) training and inference across diverse hardware platforms. +## FlagTree +FlagTree is an open source, unified compiler for multipleAI chips project dedicated to developing a diverse ecosystem of AI chip compilers and related tooling platforms, thereby fostering and strengthening the upstream and downstream Triton ecosystem. Currently in its initial phase, the project aims to maintain compatibility with existing adaptation solutions while unifying the codebase to rapidly implement single-repository multi-backend support. Forupstream model users, it provides unified compilation capabilities across multiple backends; for downstream chip manufacturers, it offers examples of Triton ecosystem integration. +## FlagScale and vllm-plugin-fl +Flagscale is a comprehensive toolkit designed to supportthe entire lifecycle of large models. It builds on the strengths of several prominent open-source projects, including [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [vLLM](https://github.com/vllm-project/vllm), to provide a robust, end-to-end solution for managing and scaling large models. +vllm-plugin-fl is a vLLM plugin built on the FlagOS unified multi-chip backend, to help flagscale support multi-chip on vllm framework. +## **FlagCX** +FlagCX is a scalable and adaptive cross-chip communication library. It serves as a platform where developers, researchers, and AI engineers can collaborate on various projects, contribute to the development of cutting-edge AI solutions, and share their work with the global community. + +## **FlagEval Evaluation Framework** + FlagEval is a comprehensive evaluation system and open platform for large models launched in 2023. It aims to establish scientific, fair, and open benchmarks, methodologies, and tools to help researchers assess model and training algorithm performance. It features: + - **Multi-dimensional Evaluation**: Supports 800+ modelevaluations across NLP, CV, Audio, and Multimodal fields,covering 20+ downstream tasks including language understanding and image-text generation. + - **Industry-Grade Use Cases**: Has completed horizonta1 evaluations of mainstream large models, providing authoritative benchmarks for chip-model performance validation. + +# Contributing + +We warmly welcome global developers to join us: + +1. Submit Issues to report problems +2. Create Pull Requests to contribute code +3. Improve technical documentation +4. Expand hardware adaptation support +# License +The model weights are derived from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B and are open‑sourced under the Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0.txt