The Tools Are Made Up

· July 30, 2025

(Updated: Apr 22, 2026)

It has been hard to keep up with the flurry of strong agentic open-source models coming out of Chinese labs recently, including Moonshot’s Kimi K2, Z.ai’s GLM 4.5, and Qwen3-Coder1.

Each of them have the mix of clever pre-training recipes and verifiable-rewards post-training. Notably, Kimi and GLM both use the Muon optimizer, which seems to be gaining ground among the OSS labs at least. GLM’s description of the recipe is as follows:

Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model’s performance on key downstream domains. Unlike the earlier pre-training stage on large-scale universal documents, these stages leverage medium-sized domain-specific datasets, including instruction data.

The additional stages, which they refer to as mid-training, extend the context window and help grow capabilities in specific domains. They then move to post-training, with SFT over reasoning and agentic traces followed by RL with Verified Rewards2.

The Kimi-K2 technical reportgoes into more details about how to actually train for tool use. Unlike the others, Kimi is not a reasoning model so doesn’t use much in the way of extended thinking. The fact that wasn’t required to get to strong levels of tool use/agentic capability feels pretty notable to me — most of the recent3 agentic models have been built on a reasoning foundation.

What I really found interesting from the Kimi report was the level of synthetic data that the team used. This starts in pretraining: to extend high quality data sources they rewrite it with another LLM, giving the same facts with new phrasing, instead of looping over the same “good” data for multiple epochs.

Their approach to tool training takes this kind of idea ever further:

We construct a comprehensive tool repository through two complementary approaches. First, we directly fetch over 3,000 real MCP (Model Context Protocol) tools from GitHub repositories, leveraging existing high-quality tool specifications. Second, we systematically evolve 82 synthetic tools through a hierarchical domain generation process. We begin with key categories (e.g., financial trading, software applications, robot control), then evolve multiple specific application domains within each category. Specialized tools are then synthesized for each domain, with clear interfaces, descriptions, and operational semantics. This evolution process produces over 20,000 synthetic tools.

They analyze a set of real tools, generate some novel (but derivative) ones, then domain-specialize them for a lot of use cases.

Once they have this tool zoo, the actual training loop involves:

  1. Randomly sample a subset of tools and give it to a new agent with a fresh system prompt. Generate tool-appropriate tasks with explicit success rubrics.
  2. Run an LLM-driven user simulator to drive the agent, while running the tools in sandbox that keeps state.
  3. Filter trajectories using another LLM as judge to keep only successful ones for SFT

They’re using models at every stage to generate data and evaluate options. When it comes to the actual RL training, they are baselining in verifiable rewards wherever possible for the RL stages: They, and the Qwen folks, talk about their simulator set up for code4: thousands of sandbox environments.

For software engineering tasks, we collect a vast amount of pull requests and issues from GitHub to build software development environment that consists of user prompts/issues and executable unit tests. This environment was built on a robust sandbox infrastructure, powered by Kubernetes for scalability and security. It supports over 10,000 concurrent sandbox instances with stable performance, making it ideal for both competitive coding and software engineering tasks

The combination of very sophisticated synthetic data and operationally intense sandboxes seem like table stakes for the current agentic game, and one which a lot of labs have figured out. Feels very promising for a growth in capabilities of these models over time, particularly as we work out how best to distill them down to smaller sizes for inference.

  1. Which seems a very solid model, but they haven’t released a lot of extra details about how they got there. One interesting component of the release though was that they forked Gemini CLI to make a qwen-code tool that works with any OpenAI compatible API, and I had some success locally plugging it into the smaller Qwen3 (non-coder) releases in case you were looking for some offline agentic capabilities! 

  2. Then GLM is distilled between the RL and base version of the model, which apparently helps generalize. This seems like a fun and relatively simple way of smoothing out the learning. 

  3. Though Claude 3.5 wasn’t, and that is really the trend-setter here I guess! 

  4. And other tasks that allow fully verifiable rewards. They use other models to score softer domains like creative writing.