LongCat-Flash: 560B AI From a Delivery App?!

HOLD UP. A 560B Model from… a Food Delivery App?! 🤯

All right team, gather ’round, because the AI world just got WEIRD. A new Chinese model, LongCat-Flash-Chat, is here. But before you dismiss this as just another model drop, you need to hear the backstory. This is the most critical part.

The model’s name is a bit of a mouthful, I’ll give you that. Long Cat Flash Cat Flash Chat… try saying that three times fast! But the real shocker isn’t the name; it’s who built it. This absolute beast of a model comes from a company called

Meituan.

Now, if you’re thinking “Mei-who?”, you’re not alone. This isn’t your usual suspect like Alibaba or Baidu. Meituan is a Chinese delivery giant they’re into food delivery, grocery tech, and all that jazz. Imagine if Uber or DoorDash suddenly dropped a foundation model that started competing with the big dogs. That’s the level of plot twist we’re talking about here.To add a little more spice to the story, Meituan is a direct business arch-rival to Alibaba, the parent company of Qwen. This isn’t just a tech release; it’s a corporate showdown spilling into the AI arena.

This whole situation points to a massive shift in the AI landscape. It used to be that only a handful of specialized AI labs had the resources to build state-of-the-art models. But what we’re seeing now is the rise of “verticalized AI.” Companies with massive, real-world datasets like Meituan’s logistics and user interaction data and deep pockets now have both the reason and the resources to build their own foundation models. LongCat-Flash feels like the first big signal of this new era.

Now for the central mystery, what I’m calling the “Efficiency Paradox.” The headlines will tell you this is a 560 BILLION parameter model. HUGE. But here’s the kicker that changes everything: on average, it only activates about 27 BILLION parameters for any given token. It dynamically flexes its muscles, using anywhere from 18.6B to 31.3B parameters depending on how tough the task is. So, how is this even possible? Let’s get into the secret sauce.

The Secret Sauce: How LongCat is SO FAST and SO SMART 🧠⚡

This model’s insane efficiency isn’t just a happy accident; it’s the result of some seriously clever architectural design. While everyone else has been caught up in the “bigger is better” scaling race, the LongCat team focused on making the architecture smarter. This proves that efficiency itself is a new kind of scaling law. They baked two key innovations right into the model’s DNA.

Zero-Computation Experts – Not All Tokens Are Created Equal!

Think of it like this: why use a supercomputer to calculate 2+2? It’s a waste of power. LongCat-Flash gets this. The technical report puts it perfectly: “As not all tokens are equal, we introduce the zero-computation experts mechanism… to allocate a dynamic computation budget to important tokens based on their significance”.

This is the core of its Mixture-of-Experts (MoE) architecture. Alongside its standard, hard-working FFN (Feed-Forward Network) experts, it has a pool of special “zero-computation experts”. These experts are lazy in the best way possible they just pass the token’s data straight through without doing any heavy lifting, costing ZERO extra compute.

For simple tokens (like “the,” “a,” or punctuation), the model can just route them to these free experts. For more complex, meaningful tokens, it activates the powerful FFN experts. This dynamic system allows it to flex its computational power precisely where it’s needed, activating between 18.6B and 31.3B parameters per token. And to make sure it doesn’t get carried away, a clever PID controller keeps the average workload at a lean ~27B parameters, ensuring consistent, efficient performance.

Shortcut-Connected MoE – NO MORE WAITING AROUND!

The classic problem with MoE models is the traffic jam. Before the model can do any math, it has to figure out which expert gets which token, and that involves a lot of communication between GPUs. This communication overhead is a massive bottleneck, leaving expensive hardware just sitting around waiting.

LongCat-Flash’s solution is GENIUS. It’s called Shortcut-connected MoE (ScMoE). It introduces a “cross-layer shortcut” that completely reorders the execution pipeline. In simple terms, it lets the model do the math for the last step while it’s simultaneously figuring out the communication for the next step. It’s doing two things at once! This creates a huge “computation-communication overlap” window, which basically eliminates the traffic jam.

This architectural masterstroke is the direct reason for the model’s mind-boggling stats. It’s how they managed to train this 560B monster on 20 TRILLION tokens in just 30 days.It’s also why the inference speed is a blistering 100 tokens per second (TPS) on H800 GPUs. This model represents a potential paradigm shift. The era of just throwing more parameters at the problem might be ending, and the era of “smarter, not just bigger” architectural design is just beginning.

The official architecture of LongCat-Flash. That ‘Shortcut’ is the key to its insane speed! (Source: Meituan LongCat Team)

OK, But Is It Any GOOD? Let’s Talk BENCHMARKS! 📊🏆

Okay, it’s fast, it’s efficient… but can it actually compete with the big dogs? Does all that clever engineering translate to real performance? Let’s see the receipts.

The team put LongCat-Flash head-to-head with other top models like DeepSeek V3.1, Kimi-K2, Claude4 Sonnet, and even GPT-4.1 (in non-thinking mode for a fair fight).And the results are seriously impressive.

First off, it’s a solid all-rounder. It scores a very strong 86.5 on ArenaHard-V2, ranking second among the models tested and showing it can hold its own in tough, head-to-head comparisons. It also puts up a great score of 89.71 on MMLU, proving its general knowledge is top-notch.

But here’s where it gets scary good. The model was specifically designed for agentic capabilities, and the benchmarks prove it.

CODING: This thing is a BEAST on the command line. On TerminalBench, which tests how well a model can handle shell commands, it scores a massive 39.51. That’s right up there with Claude Sonnet (40.74) and leaves models like DeepSeek V3.1 (31.30) in the dust.
AGENTIC TOOL USE: This is its home turf. In tasks that require complex reasoning and tool use, it’s in a league of its own. On the τ²-Bench benchmark, it consistently outperforms almost everyone. And onVitaBench, Meituan’s own super-tough, real-world agent benchmark, LongCat-Flash takes the #1 spot with a score of 24.30.

It’s clear they didn’t position this to take on the absolute flagship models like GPT-4 Opus, but in the “Flash” or “Sonnet” tier, it’s not just competing it’s dominating in the areas that matter most for building next-gen applications.

LongCat-Flash vs. The World: Benchmark Showdown

Benchmark	LongCat-Flash	DeepSeek V3.1	Kimi-K2	Claude4 Sonnet	GPT-4.1
ArenaHard-V2 (acc)	86.50	84.10	85.70 🏆	62.10	61.50
TerminalBench (acc)	39.51	31.30*	28.40	40.74 🏆	25.93
τ²-Bench (telecom)	73.68	38.50	35.20	16.50	46.20
VitaBench (avg@4) RelatedPosts DeepSeek OCR and Context Optical Compression: It’s NOT About the OCR October 21, 2025 Anthropic Messed Up Claude Code. BIG TIME. Here’s the Full Story (and Your Escape Plan). September 12, 2025	24.30 🏆	20.30	19.00	8.00	23.00

Note: Scores are from Table 3 of the LongCat-Flash Technical Report. 🏆 denotes the highest score in the row.

A quick look at the leaderboards. LongCat-Flash is holding its own, and then some! (Source: Meituan LongCat Team)

Unleashing Its “Agentic Superpowers” 🦸‍♂️

So how did a food delivery company build an AI that’s a master of complex agentic tasks? They didn’t just train it on a pile of text from the internet. They put it through a custom-built, multi-stage agentic bootcamp.

The process started with general pre-training to build a solid knowledge base. Then came a special “mid-training” phase, where they specifically enhanced its reasoning and coding skills while extending the context window to a massive 128k. This was all to prepare it for the final stage: agentic post-training.

This is where the real magic happens. Recognizing that there just isn’t enough high-quality, difficult training data for agent tasks out there, the LongCat team decided to make their own. They built a “multi-agent data synthesis framework” which is a fancy way of saying they built an army of AI agents whose only job was to create super-hard, complex problems for LongCat to solve. An AI to train an AI!.

This framework included specialized agents for every part of the problem-generation process:

UserProfileAgent: Created realistic user personas with different communication styles and goals.
ToolSetAgent: This is wild it generated 80,000 mock tools across 40 different domains to create a massive, complex tool graph for LongCat to navigate.
InstructionAgent: Generated the actual tasks, carefully controlling the difficulty across three axes: information processing complexity, tool set complexity, and user interaction complexity.

This reveals a profound shift in how top-tier AI is developed. The bottleneck is no longer just about compute power; it’s about the quality and sophistication of the training data. The LongCat team couldn’t find the data they needed, so they built a factory to create it. This approach of “data engineering” creating a synthetic curriculum is a powerful competitive advantage. They’ve essentially built a “Curriculum as a Moat.”

ENOUGH TALK. LET’S RUN THIS THING! (Copy-Paste Ready) 💻

Alright, the hype is real, the benchmarks are solid. Time to get your hands dirty. The best part? The team has made it incredibly easy to get started.

First, the all-important links:

Hugging Face: Get the model weights here: meituan-longcat/LongCat-Flash-Chat
GitHub: Check out the official code and report: github.com/meituan-longcat
Live Chat: Want to try it without any setup? Go here: longcat.chat

And the biggest news of all… the license. It’s released under the MIT License.Yes, you read that right. FULLY open, fully permissive for commercial use. GO WILD!.

Here’s a simple, copy-paste-ready Python script to run it yourself using the transformers library.

# LET'S GOOOO! Time to run LongCat Flash Chat 🚀# Make sure you have transformers, accelerate, and torch installed!# pip install transformers accelerate torchimport torchfrom transformers import AutoModelForCausalLM, AutoTokenizer# Model ID from Hugging Face - easy peasymodel_id = "meituan-longcat/LongCat-Flash-Chat"print("Loading tokenizer... 🗣️")# trust_remote_code=True is needed for custom architectures like this onetokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)print("Loading the BEAST... this might take a sec. 🏋️‍♂️")# Using device_map="auto" to automagically use your GPU if you have one!# torch_dtype="auto" will use bfloat16 for even more speed.model = AutoModelForCausalLM.from_pretrained(    model_id,    device_map="auto",    torch_dtype="auto", # or torch.bfloat16    trust_remote_code=True)print("Model loaded! Let's chat. 👇")# This is the official chat template. DON'T MESS THIS UP!messages =# Apply the template and get it ready for the modelinputs = tokenizer.apply_chat_template(    messages,    add_generation_prompt=True,    return_tensors="pt").to(model.device)# GENERATE! 🔥outputs = model.generate(    inputs,    max_new_tokens=256,    do_sample=True,    temperature=0.7,    top_p=0.95)# Decode and print the responseresponse = tokenizer.decode(outputs[len(inputs):], skip_special_tokens=True)print("\n--- LongCat's Response ---")print(response)print("--------------------------")

LongCat-Flash-Chat (Meituan) FAQ

1. Is LongCat-Flash-Chat actually from Meituan, the food delivery app?

Yes, absolutely! This is the real deal. Meituan, a Chinese tech giant best known for food delivery and local services, is the force behind LongCat-Flash-Chat. It’s surprising, but it highlights how companies with massive real-world data are entering the AI race.

2. Is LongCat-Flash-Chat open source? Can I use it commercially?

Yes, and yes! Meituan released LongCat-Flash-Chat under the MIT License, meaning you can use, modify, and even sell products built with it without heavy restrictions. The model weights and code are available on Hugging Face and GitHub.

3. How can a 560B parameter model be so fast? Isn’t that huge?

The secret is in its dynamic Mixture-of-Experts (MoE) design. While it has 560B total parameters, it only activates 18.6B to 31.3B per task averaging ~27B. That’s why it achieves 100+ tokens per second while behaving like a much smaller, more efficient model.

4. What does “non-thinking” mode mean? Is it worse than GPT-4?

“Non-thinking” just means faster inference it skips lengthy internal “chain-of-thought” reasoning before responding. While it may not beat GPT-4 Opus in deep reasoning tasks, it’s highly competitive within its speed tier and dominates benchmarks like coding and agentic reasoning.

5. Is LongCat-Flash-Chat good at coding and tool use?

Yes, especially for agentic tasks. It scored 39.5 on TerminalBench (close to Claude Sonnet) and outperformed mainstream models like DeepSeek and Kimi on multi-step agent benchmarks such as VitaBench and τ²-Bench. It’s built for real-world agent workflows.

6. Where can I try or download LongCat-Flash-Chat?

Live Demo: longcat.chat
Model Weights: Hugging Face
Code & Technical Report: GitHub

7. What are zero-computation experts, and why do they matter?

These are “lazy experts” in the MoE setup. Instead of wasting compute on simple tokens like “the” or punctuation, LongCat routes them through free-pass experts saving power for harder tokens. This is a key reason behind its efficiency and speed.

8. How does LongCat-Flash handle MoE communication bottlenecks?

It uses a clever innovation called Shortcut-connected MoE (ScMoE). It overlaps computation and communication, so the model processes one step while preparing the next. The result: near-zero latency and 100+ TPS even at massive scale.

9. How was LongCat-Flash trained so fast?

Meituan trained it on 20 trillion tokens in just 30 days using a multi-stage curriculum:

Pre-training for general knowledge
Mid-training to boost reasoning, coding, and extend context to 128K
Agentic post-training using a multi-agent synthetic data framework that generated super-hard tasks for the model to master.

10. How does LongCat-Flash compare to GPT-4, Claude, and DeepSeek?

It’s not built to dethrone GPT-4 Opus, but here’s how it stacks up in benchmarks:

Benchmark	LongCat-Flash	Claude Sonnet	DeepSeek V3.1	GPT-4.1
ArenaHard-V2	86.5	62.1	84.1	61.5
MMLU	89.7	80.2	82.3	83.4
TerminalBench	39.5	40.7 🏆	31.3	25.9
VitaBench	24.3 🏆	8.0	20.3	23.0
τ²-Bench	73.7 🏆	16.5	38.5	46.2

Bottom line: it dominates agentic tasks and is extremely competitive in coding and reasoning.

11. Why is this a big deal for the AI industry?

LongCat-Flash signals a paradigm shift: you don’t need to be an AI lab to build state-of-the-art models. Companies with huge real-world datasets like Meituan can leverage domain-specific data + compute to compete with OpenAI, Anthropic, and Google. It’s the rise of verticalized AI.

12. Is it really free to use for business?

Absolutely. The MIT license means you can build, deploy, and monetize products powered by LongCat-Flash without legal headaches. This makes it one of the most business-friendly open LLMs on the market today.

The Final Verdict: Is LongCat Flash Chat the REAL DEAL? 🤔

So, after all that, what’s the final word? Is this just a flash in the pan or a genuine game-changer?

I think it’s the real deal. Here’s the breakdown:

Surprise Origin: It comes from Meituan, a food delivery company, proving that the AI game is officially open to anyone with the data, resources, and brains to compete.
Insane Efficiency: It gives you the prestige of a 560B model with the performance cost of a ~27B model. This is the best of both worlds, made possible by the Zero-Computation Experts and ScMoE architecture.
Agentic KING: It was purpose-built for agentic tasks, and it absolutely dominates the benchmarks that matter for building the next generation of AI applications.
FULLY Open: It’s released under the MIT License, ready for the entire community to build on, experiment with, and push forward.

LongCat-Flash isn’t just another model on a leaderboard. It’s a statement. It proves that clever architecture can beat brute force, that the next big AI breakthrough can come from anywhere, and that the open-source community just got a powerful new weapon.

Now, the only question left is: what are you going to build with it?

Let me know what you think about this model in the comments! See you in the next one. Happy prompting!