Zhipu AI’s new API reaches 400 tokens per second in speed race

Zhipu AI is escalating the speed race in enterprise artificial intelligence, launching a new API for its GLM-5.1 model that reaches 400 tokens per second, a new high-water mark for commercial large language model APIs. The move challenges established players and highlights a growing market focus on inference performance as a key factor for enterprise adoption.

"The GLM-5.1 high-speed version is designed for scenarios with extremely high requirements for response latency, such as AI programming, real-time interaction, and business decision-making," the company announced in a statement.

The GLM-5.1-highspeed API is initially available to select enterprise customers on Zhipu's Maas platform. The 400 tokens/second output speed is aimed squarely at low-latency enterprise use cases—like real-time voice applications and automated business logic—that have been difficult to serve with slower, more conversational models.

This move puts pressure on global competitors by establishing a new performance benchmark for API-based inference. As companies like Kore.ai and Cerebras also push the boundaries of speed and efficiency, the focus shifts from pure model capability to production-grade performance, affecting billions in enterprise IT spending on AI infrastructure.

A Crowded Field Fights for Milliseconds

Zhipu’s announcement does not happen in a vacuum. The entire AI industry is in a fierce battle to reduce latency. While Zhipu’s 400 tokens/second sets a record for a commercial API, other companies are posting even higher speeds with specialized configurations. Chip startup Cerebras recently announced its platform runs the trillion-parameter Kimi K2.6 model at 981 tokens per second, nearly seven times faster than GPU-based clouds. However, this relies on Cerebras's unique wafer-scale engine, a specialized hardware architecture not accessible via a general API.

The competition extends beyond pure hardware performance. Enterprise AI platform provider Kore.ai recently launched its Artemis platform, designed to let enterprises build and govern AI agents. The launch underscores that while speed is critical, factors like governance, security, and vendor neutrality are equally important for adoption in regulated industries like finance and healthcare. This places Zhipu’s speed benchmark in a broader context, competing with the ecosystems of giants like Microsoft, Google, and Salesforce.

From Raw Power to Enterprise-Ready

The chase for faster token generation is driven by a clear business need. For AI to become integral to core business processes, it must operate in real-time. Use cases like real-time voice transcription, interactive data analysis for financial traders, or dynamic e-commerce recommendations require near-instantaneous responses that many current models cannot provide. Zhipu is directly targeting this market segment, where a few hundred milliseconds of latency can make a product non-viable.

For investors, this trend signals a maturation of the AI market. While model size and benchmark scores have historically grabbed headlines, the ability to serve these models quickly and cost-effectively is where value is captured. Zhipu's offering could lower the barrier for enterprises to deploy more sophisticated AI, potentially capturing market share from slower incumbents. The success of platforms from Zhipu, Kore.ai, and others will depend on their ability to deliver not just a fast model, but a complete, reliable, and secure enterprise solution.

This article is for informational purposes only and does not constitute investment advice.