Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


As large language models (LLMs) continue to improve in coding, the benchmarks used to evaluate their performance are steadily becoming less useful.

That’s because even as many LLMs have similar high scores on these benchmarks, understanding which ones to use on specific software development projects and enterprises can be difficult.

A new paper by Yale University and Tsinghua University presents a novel method to test the ability of models to tackle “self-invoking code generation” problems that require reasoning, generating code, and reusing existing code in problem-solving.

Self-invoking code generation is much more similar to realistic programming scenarios and provides a better understanding of current LLMs’ ability to solve real-world coding problems.

Self-invoking code generation

Two popular benchmarks used to evaluate the coding abilities of LLMs are HumanEval and MBPP (Mostly Basic Python Problems). These are datasets of handcrafted problems that require the model to write code for simple tasks. 

However, these benchmarks only cover a subset of the challenges software developers face in the real world. In practical scenarios, software developers don’t just write new code—they must also understand and reuse existing code and create reusable components to solve complex problems.

“The ability to understand and subsequently leverage one’s own generated code, namely self-invoking code generation, plays an important role for LLMs to leverage their reasoning capabilities to code generation that current benchmarks fail to capture,” the researchers write.

To test the ability of LLMs in self-invoking code generation, the researchers created two new benchmarks, HumanEval Pro and MBPP Pro, which extend the existing datasets. Each problem in HumanEval Pro and MBPP Pro builds on top of an existing example in the original dataset and introduces additional elements that require the model to solve the base problem and invoke the solution to solve a more complex problem. 

Self-invoking code generation
Self-invoking code generation (source: arXiv)

For example, the original problem can be something simple, like writing a function that replaces all occurrences of a given character in a string with a new character.

The extended problem would be to write a function that changes occurrences of multiple characters in a string with their given replacements. This would require the model to write a new function that invokes the previous function it generated in the simple problem. 

“This evaluation of self-invoking code generation offers deeper insights into the programming capabilities of LLMs, extending beyond the scope of single-problem code generation,” the researchers write.

LLMs perform poorly at self-invoking code generation

The researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including GPT-4o, OpenAI o1-mini, Claude 3.5 Sonnet, as well as Qwen, DeepSeek, and Codestral series.

Their findings show a significant disparity between traditional coding benchmarks and self-invoking code generation tasks. “While frontier LLMs excel at generating individual code snippets, they often struggle to effectively utilizing their own generated code for solving more complex problems,” the researchers write.

For example, with a single generation (pass@1), o1-mini achieves 96.2% on HumanEval but only 76.2% on HumanEval Pro.

Another interesting finding is that while instruction fine-tuning provides significant improvements on simple coding tasks, it shows diminishing returns on self-invoking code generation. The researchers note that “current instruction-based fine-tuning approaches are insufficiently effective for more complex self-invoking code generation tasks,” suggesting that we need to rethink how we train base models for coding and reasoning tasks.

To help advance research on self-invoking code generation, the researchers propose a technique to automatically repurpose existing coding benchmarks for self-invoking code generation. The approach uses frontier LLMs to generate self-invoking problems based on the original problems. They then generate candidate solutions and verify their correctness by executing the code and running test cases on them. The pipeline minimizes the need for manual code review to help generate more examples with less effort.

Automatically generating self-invoking code generation problems (source: arXiv)

A complex landscape

This new family of benchmarks comes at a time when old coding benchmarks are quickly being conquered by frontier models. Current frontier models such as GPT-4o, o1, and Claude 3.5 Sonnet already have very high scores on HumanEval and MBPP as well as their more advanced versions, HumanEval+ and MBPP+. 

At the same time, there are more complex benchmarks such as SWE-Bench, which evaluate models’ capabilities in end-to-end software engineering tasks that require a wide range of skills such as using external libraries and files, and managing DevOps tools. SWE-Bench is a very difficult benchmark and even the most advanced models are showing modest performance. For example, OpenAI o1 is inconsistent on SWE-Bench Verified.

Self-invoking code generation sits somewhere between the simple benchmarks and SWE-Bench. It helps evaluate a very specific type of reasoning ability: using existing code within a module to tackle complex problems. Self-invoking code benchmarks can prove to be a very practical proxy for the usefulness of LLMs in real-world settings, where human programmers are in control and AI copilots help them accomplish specific coding tasks in the software development process.

“HumanEval Pro and MBPP Pro are positioned to serve as valuable benchmarks for code-related evaluations and to inspire future LLM development by shedding light on current model shortcomings and encouraging innovation in training methodologies,” the researchers write.

Related Posts

The Collapse of USAID Is Fueling Human Trafficking and Slavery at Scammer Compounds

As Elon Musk’s DOGE team continues to rampage through United States federal agencies, Trump administration efforts to eliminate the United States Agency for International Development (USAID) seem to be furthest…

Read more

Can You Use Your Samsung Galaxy Buds As Hearing Aids?

For those who are hard of hearing or struggle with hearing loss, hearing aids are a lifeline to the world, enabling users to navigate the commotion of everyday life. But…

Read more

The Far Right Has a New Hero: Elon Musk

Elon Musk has overtaken President Donald Trump as the hero of America’s far-right movement. A WIRED review of message boards, encrypted channels, social media platforms, online search trends, and public…

Read more

LinkedIn Is Testing an AI Tool That Could Transform How People Search for Jobs

LinkedIn is testing a new job-hunting tool that uses a custom large language model to comb through huge quantities of data to help people find prospective roles. The company believes…

Read more

4 Signature Pontiac Design Features That Defined The Brand

Fpg/Getty Images Pontiac has long been a brand that has set itself apart through the distinctive design of many of its cars. For much of its history, going back as far…

Read more

Amazon is holding a devices event on February 26

Amazon is set to show off some new stuff later this month. The company has scheduled a devices event for February 26 in New York City. The company’s hardware chief,…

Read more

Leave a Reply