Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


As large language models (LLMs) continue to improve in coding, the benchmarks used to evaluate their performance are steadily becoming less useful.

That’s because even as many LLMs have similar high scores on these benchmarks, understanding which ones to use on specific software development projects and enterprises can be difficult.

A new paper by Yale University and Tsinghua University presents a novel method to test the ability of models to tackle “self-invoking code generation” problems that require reasoning, generating code, and reusing existing code in problem-solving.

Self-invoking code generation is much more similar to realistic programming scenarios and provides a better understanding of current LLMs’ ability to solve real-world coding problems.

Self-invoking code generation

Two popular benchmarks used to evaluate the coding abilities of LLMs are HumanEval and MBPP (Mostly Basic Python Problems). These are datasets of handcrafted problems that require the model to write code for simple tasks. 

However, these benchmarks only cover a subset of the challenges software developers face in the real world. In practical scenarios, software developers don’t just write new code—they must also understand and reuse existing code and create reusable components to solve complex problems.

“The ability to understand and subsequently leverage one’s own generated code, namely self-invoking code generation, plays an important role for LLMs to leverage their reasoning capabilities to code generation that current benchmarks fail to capture,” the researchers write.

To test the ability of LLMs in self-invoking code generation, the researchers created two new benchmarks, HumanEval Pro and MBPP Pro, which extend the existing datasets. Each problem in HumanEval Pro and MBPP Pro builds on top of an existing example in the original dataset and introduces additional elements that require the model to solve the base problem and invoke the solution to solve a more complex problem. 

Self-invoking code generation
Self-invoking code generation (source: arXiv)

For example, the original problem can be something simple, like writing a function that replaces all occurrences of a given character in a string with a new character.

The extended problem would be to write a function that changes occurrences of multiple characters in a string with their given replacements. This would require the model to write a new function that invokes the previous function it generated in the simple problem. 

“This evaluation of self-invoking code generation offers deeper insights into the programming capabilities of LLMs, extending beyond the scope of single-problem code generation,” the researchers write.

LLMs perform poorly at self-invoking code generation

The researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including GPT-4o, OpenAI o1-mini, Claude 3.5 Sonnet, as well as Qwen, DeepSeek, and Codestral series.

Their findings show a significant disparity between traditional coding benchmarks and self-invoking code generation tasks. “While frontier LLMs excel at generating individual code snippets, they often struggle to effectively utilizing their own generated code for solving more complex problems,” the researchers write.

For example, with a single generation (pass@1), o1-mini achieves 96.2% on HumanEval but only 76.2% on HumanEval Pro.

Another interesting finding is that while instruction fine-tuning provides significant improvements on simple coding tasks, it shows diminishing returns on self-invoking code generation. The researchers note that “current instruction-based fine-tuning approaches are insufficiently effective for more complex self-invoking code generation tasks,” suggesting that we need to rethink how we train base models for coding and reasoning tasks.

To help advance research on self-invoking code generation, the researchers propose a technique to automatically repurpose existing coding benchmarks for self-invoking code generation. The approach uses frontier LLMs to generate self-invoking problems based on the original problems. They then generate candidate solutions and verify their correctness by executing the code and running test cases on them. The pipeline minimizes the need for manual code review to help generate more examples with less effort.

Automatically generating self-invoking code generation problems (source: arXiv)

A complex landscape

This new family of benchmarks comes at a time when old coding benchmarks are quickly being conquered by frontier models. Current frontier models such as GPT-4o, o1, and Claude 3.5 Sonnet already have very high scores on HumanEval and MBPP as well as their more advanced versions, HumanEval+ and MBPP+. 

At the same time, there are more complex benchmarks such as SWE-Bench, which evaluate models’ capabilities in end-to-end software engineering tasks that require a wide range of skills such as using external libraries and files, and managing DevOps tools. SWE-Bench is a very difficult benchmark and even the most advanced models are showing modest performance. For example, OpenAI o1 is inconsistent on SWE-Bench Verified.

Self-invoking code generation sits somewhere between the simple benchmarks and SWE-Bench. It helps evaluate a very specific type of reasoning ability: using existing code within a module to tackle complex problems. Self-invoking code benchmarks can prove to be a very practical proxy for the usefulness of LLMs in real-world settings, where human programmers are in control and AI copilots help them accomplish specific coding tasks in the software development process.

“HumanEval Pro and MBPP Pro are positioned to serve as valuable benchmarks for code-related evaluations and to inspire future LLM development by shedding light on current model shortcomings and encouraging innovation in training methodologies,” the researchers write.

Related Posts

This DOGE Engineer Has Access to the National Oceanic and Atmospheric Administration

An engineer named Nikhil Rajpal is representing Elon Musk’s so-called Department of Government Efficiency (DOGE) task force at the National Oceanic and Atmospheric Administration (NOAA), according to multiple sources. Government…

Read more

Apple’s M2 MacBook Air drops to $800

The M2 MacBook Air is on sale for just $800 via Amazon. This is a decent discount, as this model has been going for around $1,000 lately. The well-reviewed laptop…

Read more

Elon Musk’s Takeover Is Causing Rifts in Donald Trump’s Inner Circle

Elon Musk and President Donald Trump, publicly at least, are on good terms. Yet when it comes to the staff in and around the new administration, it’s a different story….

Read more

Why Did Fiat Discontinue The Multipla, And How Much Is One Worth Today?

Studio MDF/Shutterstock The Fiat Multipla is often considered one of the ugliest and weirdest cars ever made due to its bulbous behind, beluga-like blubbery forehead, and extra set of headlights….

Read more

How To Remove Snow From Your Car Without A Scraper

We may receive a commission on purchases made from links. Tetra Images/Getty If you’ve ever lived somewhere that experiences harsh winters with heavy snowfall, you probably understand just how debilitating…

Read more

5 Things You Can’t Buy With An Amazon Gift Card

Nicole Glass Photography/Shutterstock It often feels like you can buy almost anything from Amazon. The company’s website and mobile apps offer millions of products, ranging from books and music to…

Read more

Leave a Reply