Human Software Developers vs ChatGPT - Who's Better in Fixing GitHub Pull Requests?

In the age of alien technology — large language models (LLMs), advanced AI and machine learning — it’s easy to fall into the belief that human roles, especially in tasks like software engineering, are on the verge of being replaced.

I’m definitely guilty of prematurely announcing various job roles dead, even though they are not and may even survive another decade or so. 😅

In defense of the more traditionalist argument that ChatGPT is overhyped, the recent paper titled “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” sheds light on the limitations of even the most advanced language models in the field of software development.

Too Long Didn’t Read

The paper underscores that while language models have grown significantly in capability, see our Finxter blog article on Claude 2 with a 200k token context window, gauging their effectiveness remains a challenge.

Taking real-world software engineering as a testament to this, the study introduces “SWE-bench”. This framework comprises 2,294 software engineering problems sourced directly from actual GitHub issues and corresponding pull requests across 12 renowned Python repositories.

*All Midjourney images in this article are created by my daughter*

In the framework, a language model is presented with a codebase and a detailed description of a software issue. The model’s job? Modify the codebase to rectify the problem.

However, it’s not as straightforward as it sounds. Addressing these issues often means understanding and amending multiple functions, classes, and files concurrently. This requires the models to engage with execution environments, process extensive contexts, and employ intricate reasoning – a leap from the standard code generation.

Yet, the results are telling. The advanced proprietary models and the SWE-Llama model, fine-tuned for this task, were only successful in resolving the most basic issues. Even the revered Claude 2 and GPT-4 models managed to solve just 4.8% and 1.7% of problems, respectively, despite the aid of an oracle retriever.

The cognitive processes, creative reasoning, and nuanced understanding of a human brain, especially a trained software developer, are unmatched.

Here are three observations and key takeaways from the paper:

The Real World Is Complex: The findings emphasize the intricacies involved in real-world software development problems. While language models like GPT-4 and GPT-4V are impressive in many applications, they are still nascent when it comes to understanding and fixing multifaceted software issues.
Human Expertise Remains Unparalleled: The cognitive processes, creative reasoning, and nuanced understanding of a human brain, especially a trained software developer, are unmatched. These tasks often require a deep comprehension of the software’s intent, user experience, and system interactions, areas where language models are still catching up.
Language Models Are More Tools As They Are Replacements: The results hint at the potential of using language models as supplementary tools for software devs. While they might not replace the core tasks of a developer, they can assist in automating repetitive tasks or offering coding suggestions.

SWE Bench

Let’s start with a visual from the paper:

The figure showcases the workflow of the SWE-bench framework.

Step 1: It begins with a real-world GitHub issue related to a Python repository.
Step 2: Once this issue is fed to a language model, the model attempts to generate a patch or solution for the reported problem in the codebase.
Step 3: This generated patch is represented as a pull request (PR) indicating changes in specific files.
Step 4: The efficacy of the model’s solution is then validated against unit tests. The results show which tests failed before the PR and which passed after the model’s intervention, providing a direct assessment of the model’s problem-solving abilities in real-world software engineering scenarios.

SWE-bench is a benchmark created from GitHub issues and pull requests. It sources from about 90,000 pull requests in well-known repositories.

The process involves:

Selecting PRs that address issues and include tests.
Assessing the tests, ensuring they transition from failing to passing.
Making sure there are no errors during installation or runtime.
This results in 2,294 tasks that define SWE-bench.

Distribution of SWE-bench tasks by Python framework (source)

In the tasks, models are given an issue and a codebase. Their job is to fix the issue in the code. They’re graded on whether their fixes pass the tests. 👉 GitHub Repository

SWE-bench stands out because:

It’s built on real, complex coding challenges.
It updates itself with new GitHub issues.
Tasks require sorting through long descriptions and large codebases.
Every task has a crucial test it must pass, plus many additional tests.

The benchmark doesn’t limit models, pushing them to create unique solutions. This makes it a valuable tool for evaluating software engineering models.

Approaches to Fit Code into Context Window

🧑‍💻 Challenge: The context window of GPT-4 is only 16k tokens. With Claude 2, you can pass 100k tokens into the context window. However, code bases in the real world are still too large for both models, with 438k lines of code on average. So, the LLM brains are not yet able to fit everything into their context memories!

When faced with the challenge of processing vast codebases, two distinct retrieval approaches were introduced to determine the most relevant context for the model:

Sparse Retrieval:

Why It’s Used: The length of the codebases and the distinct task of matching natural language queries to lengthy code documents make dense retrieval methods unsuitable.
Method: BM25, a well-known retrieval method, is used to fetch the most relevant files to present as context for each task.
Context Limits: Three different maximum context limits were experimented with. Files are retrieved until the specified limit is reached, and the model’s performance is gauged based on whichever context limit fits best.

Oracle Retrieval:

Principle: This method solely involves files that were edited by the reference patch that addressed the issue on GitHub.
Pros: It provides direct context by focusing on the actual changes made to resolve an issue.
Cons: There are inherent limitations:
- Software engineers wouldn’t usually know in advance which files might need alterations.
- Just considering edited files might not offer a comprehensive understanding, omitting other potential interactions or dependencies in the code.

Comparison:

The BM25 retrieval approach, when limited to a 27,000 token context, manages to retrieve approximately 40% of the files that the oracle retrieval does.
However, in over half the cases, BM25 misses out on all the files that oracle retrieval brings into focus.

Eventually, I’m convinced that neither approach will be used. We’ll see context windows of 1M tokens and more so that this particular problem will resolve itself just by throwing more scale at it.

Results: LLMs on Their Own Are Poor Software Developers

The study tested various LLMs across different settings using different retrieval mechanisms and prompting styles.

The performance of these models was analyzed across the BM25 and “oracle” retrieval settings. Generally, all models faced significant challenges in resolving issues. Among them, Claude 2 was the best performer but achieved a mere 4.8% success rate when using the “oracle” retrieval context. However, its performance dropped to 1.96% in the BM25 retrieval setting, underscoring the significance of selecting the right context.

When examining the results across different repositories, trends remained consistent for all models. Nevertheless, the problems solved by each model didn’t overlap much. For instance, Claude 2 and SWE-Llama 13b had comparable results in the “oracle” setting, but there wasn’t much overlap in the problems they addressed.

There’s an observation that some repositories had more instances with images in the text, implying that solving such instances might necessitate multi-modal language models or external tools to handle the images.

As for context length, it was observed that models performed poorly when presented with longer sequences of code. For instance, Claude 2’s performance declined substantially with increasing context length, which was consistent with other models. Additional context often distracted models, making it challenging for them to pinpoint the exact problematic code.

Performance of the models was also gauged by the type of retrieval setting used. In a setting where only the lines modified by a genuine pull request were considered, performance improved. GPT-4’s performance, for instance, jumped from 1.3% to 3.4%, and Claude 2’s from 4.8% to 5.9%.

The date when the issues were created didn’t seem to significantly influence the difficulty of resolution. There was minimal performance difference for issues created before or after 2023, which is promising as it indicates that the models are unlikely to provide solutions based on more recent versions of the codebase.

Models that had been fine-tuned, such as SWE-Llama 7b and 13b, displayed sensitivity to shifts in context distribution, affecting their performance. There was also a focus on the format in which models generate their solutions. While they usually struggle with creating well-structured patch files, it was noted that asking them to recreate an entire file with the suggested changes led to even poorer performance.

It was observed that the patches generated by models were generally shorter and simpler compared to the gold standard. For instance, model-generated patches typically added or removed fewer lines.

What Does It Mean for Software Devs?

The digital horizon often brings forth fears of redundancy, especially in professions that heavily rely on logic and algorithmic processes.

But as “SWE-bench” reveals, the value of human software engineers remains intact, and their role in shaping the digital world is far from over.

Instead of seeing advanced language models as threats, we should view them as tools in our ever-expanding toolkit, tools that make the process of creation smoother but don’t replace the creator.

🧑‍💻 Recommended: Software Engineering 2.0 – We Need to Talk About LLMs