Agents using debugging tools drastically outperformed those that weren’t, but their success rate was still not sufficient.
Credit: Microsoft Research
This approach is much more successful than relying on a model because the model is normally used, but if the best case is a success rate of 48.4%, it is not ready for prime time. There are likely limitations as the model doesn’t fully understand how to use the tool optimally, and because the current training data is not tailored to this use case.
“We believe this is due to a lack of data representing the continuous decision-making behavior (e.g., debug traces) in the current LLM training corpus,” the blog post states. “However, a significant improvement in performance validates that this is a promising research direction.”
The post claims that this initial report is just the beginning of the effort. The next step is to “fine tweak the model that requires information specifically to gather the information needed to resolve bugs.” If the model is large, the best move to save inference costs is to “build a model that seeks smaller information that can provide larger information.”
This is not the first time I’ve seen results that suggest that some of the ambitious ideas about AI agents that directly replace developers are quite far from reality. AI tools sometimes allow users to create applications that are thought to be acceptable for narrow tasks, the model tends to generate code with bugs and security vulnerabilities, and generally indicates that they cannot fix those issues.
While this is an early step in the path to AI coding agents, most researchers agree that the best results are agents that save a considerable amount of time for human developers, and that everything they can do is not something they can do, and is likely.