Understanding and Improving Flaky Test Classification (SPLASH 2025 - OOPSLA)

Sun 12 - Sat 18 October 2025 Singapore

co-located with ICFP/SPLASH 2025

Who

Shanto Rahman, Saikat Dutta, August Shi

Track

SPLASH 2025 OOPSLA

Time Zone

The program is currently displayed in (GMT+08:00) Perth.

Use conference time zone: (GMT+08:00) PerthSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 17 Oct 2025 15:15 - 15:30 at Orchid Plenary Ballroom - Testing 1 Chair(s): Karine Even-Mendoza

Abstract

Regression testing is an essential part of software development, but it suffers from the presence of flaky tests - tests that pass or fail non-deterministically on the same codebase. These unpredictable failures can waste developers’ time, and hide real bugs. While prior work has explored using fine-tuned large language models (LLMs) to classify flaky test categories with near-perfect accuracy, we find that these results are significantly overestimated due to flawed experimental design and unrealistic datasets.

In this paper, we first show that prior flaky-test classifiers over-estimate the prediction accuracy due to 1) flawed experiment design and 2) mis-representation of the real distribution of flaky (and non-flaky) tests in their datasets. After we fix design flaws and construct a more realistic dataset, we observe a tremendous drop in F1-score, from 85.38% to 56.62%. Motivated by these observations, we develop a new training strategy to fine-tune a new flaky-test classifier, FlakyLens, that improves the classification F1-score to 67.50% (10.88pp higher than the state-of-the-art). We also compare FlakyLens against recent state-of-the-art pre-trained LLMs on the same classification task. Our results show that FlakyLens consistently outperforms these models, highlighting that general-purpose LLMs still fall short on this specialized task.

Using our improved flaky-test classifier, we identify the important tokens in the test code that influence the models in making correct or incorrect predictions. By leveraging attribution scores computed per code token in each test, we investigate the tokens that have higher impact on the flaky-test classifier’s decision-making per flaky test category. To assess the influence of these important tokens, we introduce adversarial perturbation into the tests and observe whether the model’s predictions change. Our findings show that, when perturbing the most important tokens, the classification accuracy can drop by as much as -17.04pp. This highlights the critical role of these tokens in guiding the model’s predictions.

Shanto Rahman

The University of Texas at Austin

United States

Saikat Dutta

Cornell University

United States

August Shi

The University of Texas at Austin

United States

Time Zone

The program is currently displayed in (GMT+08:00) Perth.

Use conference time zone: (GMT+08:00) PerthSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 17 Oct
Displayed time zone: Perth change

13:45 - 15:30	Testing 1OOPSLA at Orchid Plenary Ballroom Chair(s): Karine Even-Mendoza King’s College London

13:45 15m Talk		An Empirical Evaluation of Property-Based Testing in Python OOPSLA Savitha Ravi UC San Diego, Michael Coblenz University of California, San Diego Link to publication
14:00 15m Talk		Fray: An Efficient General-Purpose Concurrency Testing Platform for the JVM OOPSLA Ao Li Carnegie Mellon University, Byeongjee Kang Carnegie Mellon University, Vasudev Vikram Carnegie Mellon University, Isabella Laybourn Carnegie Mellon University, Samvid Dharanikota Efficient Computer, Shrey Tiwari Carnegie Mellon University, Rohan Padhye Carnegie Mellon University Pre-print Media Attached
14:15 15m Talk		Fuzzing C++ Compilers via Type-Driven Mutation OOPSLA Bo Wang Beijing Jiaotong University, Chong Chen Beijing Jiaotong University, Ming Deng Beijing Jiaotong University, Junjie Chen Tianjin University, Xing Zhang Peking University, Youfang Lin Beijing Jiaotong University, Dan Hao Peking University, Jun Sun Singapore Management University
14:30 15m Talk		Interleaving Large Language Models for Compiler Testing OOPSLA Yunbo Ni The Chinese University of Hong Kong, Shaohua Li The Chinese University of Hong Kong
14:45 15m Talk		Model-guided Fuzzing of Distributed Systems OOPSLA Ege Berkay Gulcan Delft University of Technology, Burcu Kulahcioglu Ozkan Delft University of Technology, Rupak Majumdar MPI-SWS, Srinidhi Nagendra IRIF, Chennai Mathematical Institute
15:00 15m Talk		Tuning Random Generators: Property-Based Testing as Probabilistic Programming OOPSLA Ryan Tjoa University of Washington; Jane Street, Poorva Garg University of California, Los Angeles, Harrison Goldstein University at Buffalo, the State University of New York at Buffalo, Todd Millstein University of California at Los Angeles, Benjamin C. Pierce University of Pennsylvania, Guy Van den Broeck University of California at Los Angeles DOI Pre-print
15:15 15m Talk		Understanding and Improving Flaky Test Classification OOPSLA Shanto Rahman The University of Texas at Austin, Saikat Dutta Cornell University, August Shi The University of Texas at Austin