OpenAI Benchmarks Economic Impacts of AI
Written by Sue Gee   
Wednesday, 26 February 2025

Using a new benchmark, OpenAI researchers have revealed a significant shortcoming in the ability of the latest LLMs to perform real-world software engineering tasks. These AI tools may improve productivity, but they are nowhere near ready to take over our jobs.

OpenAIbanner

OpenAI has devised a new benchmark to evaluate and forecast the software engineering (SWE) capabilities of the emerging AI frontier models. It maps model performance to monetary value in order to explore the economic impact of AI model development.

With the title SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? the study takes a novel approach to assessing the practical capabilities of LLMs.

The SWE-Lancer benchmark comprises 1,488 freelance software engineering tasks from Upwork, an online platform where freelancers find clients to work on projects remotely. All the tasks have had real payouts to freelancers and this provides:

"a natural market-derived difficulty gradient (ranging from $250 to $32,000)". 

There are two distinct types of task:

  •  Individual Contributor (IC) SWE tasks, where models generate code patches to resolve real-world issues. These range in difficulty from 15-minute bug f ixes to new feature requests that took weeks to close. Their evaluation uses end-to-end tests created by a team of professional software engineers which use browser automation to verify application behavior and mirror the real-world freelance review process, and have been triple-verified for quality by experienced software engineers.

  • SWE Manager tasks, where models act as technical leads by selecting the best implementation proposal for a given problem. The models are directed to review competing proposals submitted by freelancers in response to job postings, and select the best one. They are then assessed against the choices of the original engineering managers. These tasks require a deep technical understanding of both the issue and the proposals; it is often the case that multiple proposals are technically correct, and identifying the winning proposal requires considering context from the entire repository.

In total the tasks included in the full SWE-Lancer were valued at $1 million. The tasks in SWE-Lancer Diamond, the open-sourced split released for the purposes of public evaluation, contains $500,800 worth of tasks. 

In terms of difficulty, SWE-Lancer tasks are considered to be challenging:

The average task in the Diamond set takes 26 days to resolve on Github and has 47 comments, which include discussion of multiple proposals. On average, IC SWE tasks in the Diamond set require modifying 2 files and 69 lines of code, while SWE Manager tasks require the model to choose between 4-5 proposals.

The Open AI models tested with the benchmark were its flagship model, GPT-4o, seen as a very strong general-purpose model, and its more recent o1 model, designed to be strong in complex reasoning tasks by virtue of "thinking" more deeply before responding. The third model was Anthropic's Claude 3.5 Sonnet which is considered to be very strong in logical reasoning and complex task handling. 

The results reveal that Claude outperformed OpenAI's own models as measured by the SWE Benchmark but still left a lot to be desired in that it would only have earned just over $400,000 of the potential $1,000,000 paid out to human freelancers.

SWE-LancerEarnings

All the models performed better on SWE Manager tasks than on IC SWE tasks as shown below, which is for the first attempt:

SWE-LancerPerf

Performance did, however, improve when the two OpenAI models were allowed multiple attempts on IC SWE tasks within the SWE-Lancer Diamond set:

SWE-LancerPerfK

 

 In its concluding Impact Statement the researchers state: 

 “By quantifying AI progress in software engineering, we aim to help inform the world about the potential economic impacts of AI model development while underscoring the need for careful and responsible deployment.”

By open sourcing the SWE-Lancer Diamond public eval set OpenAI is proving further support for measuring the economic and social implications of AI models.

 


opena1new

More Information

Introducing the SWE-Lancer benchmark

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? by Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke

SWELancer-Benchmark on GitHub

 

Related Articles

OpenAI o1 Thinks Before It Answers

JetBrains Adds Claude Support To AI Assistant

Claude Engineer Amplifies Your Code

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


GitHub Copilot Gets Agent Mode
10/02/2025

GitHub has announced new features for GitHub Copilot to streamline coding tasks. Copilot will now predict what a developer might be going to type next based on what they have already typed. The tool a [ ... ]



The Dynamic DevOps Roadmap
05/02/2025

Αre you taking your first steps in the Devops world? Have you gone beyond the basics and trying to find your bearings figuring out what to focus on next? No worries, the Dynamic DevOps Road [ ... ]


More News

espbook

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 26 February 2025 )