OpenAI has released a human-validated subset of SWE-Bench Verified that more reliably evaluates the ability of AI models to solve real-world software issues.
Reports state that OpenAI has developed a range of metrics to track, evaluate, and forecast models’ abilities to act autonomously. The software capabilities have been evaluated through various challenges due to the complexity of software engineering tasks, the major difficulty arises in accessing the generated code accurately, therefore the approach towards Preparedness must involve a careful examination of evaluations to reduce the potential of underestimating or overestimating the important performance in risk categories.
Talking to which one of the most popular evaluation suites for software engineering is SWE-Bench Verified, a benchmark for evaluating large language models’ abilities to solve real-world software issues sourced from GitHub. The benchmark involves providing agents with a code repository and issue description. Moreover coding agents have made impressive progress on SWE-bench, with agents scoring 20% on SWE-bench and 43% on SWE-bench Lite according to the SWE-Bench Verified leaderboard as of August 5, 2024.
Furthermore, the testing identifies some SWE-bench tasks that may be impossible to solve, leading to systematically underestimating the models of software engineering capabilities. OpenAI has collaborated with the authors of SWE-bench to address new release issues of benchmark to provide accurate evaluations.
Background on SWE-bench
Each sample in the SWE-bench test set is created from a resolved GitHub issue and is one of 12 open-source Python repositories, each sample has an associated pull request (PR) including both the solution code and unit tests to verify code correctness. Therefore for each sample in SWE-bench agents are provided with the original text from the GitHub issue known as a problem statement and are given access to the codebase.
A proposed edit is evaluated by running both the FAIL_TO_PASS, if this test is passed it means the edit solves the issue and PASS_TO_PASS test is passed then the edit has not inadvertently broken unrelated sections of the codebase.
Adapting SWE-Bench Verified as a Preparedness Evaluation
The potential of the SWE-Bench Verified for the Preparedness Framework has aimed to find ways in which robustness can be improved with the reliability of the benchmark. The major improvement identified:
- The unit tests used to evaluate the correctness of a solution are often overly specific, which causes the correct solution to be rejected.
- Many samples have issue descriptions that are underspecified, and the possibility to understand what the problem is and how it should be solved.
- It becomes sometimes difficult to set up the SWE-Bench Verified development environments for the agents causing tests to fail, in such cases perfectly valid solutions might be graded as incorrect.