At the most basic level, large language models are probabilistic systems trained on massive text corpora. They learn the patterns among words and symbols, and then generate the sequence of words that is most appropriate for a given context.

This is, broadly speaking, how systems such as ChatGPT, Claude, and Gemini work. However, this logic comes with a very high computational cost. During training, billions of parameters are adjusted, enormous datasets are processed, and this process runs for weeks, sometimes months, across thousands of advanced chips. For that reason, the cost at which you can train a powerful model largely determines your position in the race.
In this race, where those with more GPUs tend to pull ahead, some engineering approaches developed by Chinese companies that cannot access the quantity and type of chips they want have also shown that it is possible to extract more efficiency from the same hardware.
One of these approaches is the Mixture of Experts (MoE) architecture. In traditional models, the entire model is activated for every query. In MoE, by contrast, the model is divided into different expert subnetworks, and only the relevant parts are activated for each input. This allows the model to remain large while reducing the effective computational cost required for each operation.
Another important tool is 8-bit Floating Point (FP8) training. Large language models perform numerical operations during training, and the precision used to represent those numbers directly affects cost. FP8 reduces memory use and data transfer load by carrying out some operations at lower precision, at the 8-bit level. Rather than simplifying the model itself, this approach aims to lower training costs by carefully selecting where precision can be reduced.
A third innovation that has come to the fore is MLA, or Multi-head Latent Attention. In large models, a significant part of the cost comes from the need to keep context in memory. As the context required to generate a response grows longer, memory load and data movement increase. MLA improves efficiency by managing this burden through a more compressed representation. The goal is to store the necessary information in a more compact form and achieve the same task with a lighter memory structure.
A fourth area of improvement is communication optimization. Large models do not run on a single chip; they are distributed across many GPUs and servers. In that case, the problem is not only computational power. How quickly and efficiently data moves across these components also becomes critical. Methods that better align computation with data communication are believed to improve both effectiveness and efficiency.
The ability to apply engineering solutions that reduce costs and increase efficiency under hardware constraints partially relieves China of its disadvantaged position in the generative AI race. MoE reduces unnecessary computation, FP8 lowers numerical cost, MLA lightens memory load, and communication optimization makes data traffic in distributed systems more manageable. In this way, China can deliver stronger performance relative to the lower-tier processors it has available.
No responses yet