SPLASH 2025
Sun 12 - Sat 18 October 2025 Singapore
co-located with ICFP/SPLASH 2025
Fri 17 Oct 2025 15:15 - 15:30 at Orchid Plenary Ballroom - Testing 1 Chair(s): Karine Even-Mendoza

Regression testing is an essential part of software development, but it suffers from the presence of flaky tests - tests that pass or fail non-deterministically on the same codebase. These unpredictable failures can waste developers’ time, and hide real bugs. While prior work has explored using fine-tuned large language models (LLMs) to classify flaky test categories with near-perfect accuracy, we find that these results are significantly overestimated due to flawed experimental design and unrealistic datasets.

In this paper, we first show that prior flaky-test classifiers over-estimate the prediction accuracy due to 1) flawed experiment design and 2) mis-representation of the real distribution of flaky (and non-flaky) tests in their datasets. After we fix design flaws and construct a more realistic dataset, we observe a tremendous drop in F1-score, from 85.38% to 56.62%. Motivated by these observations, we develop a new training strategy to fine-tune a new flaky-test classifier, FlakyLens, that improves the classification F1-score to 67.50% (10.88pp higher than the state-of-the-art). We also compare FlakyLens against recent state-of-the-art pre-trained LLMs on the same classification task. Our results show that FlakyLens consistently outperforms these models, highlighting that general-purpose LLMs still fall short on this specialized task.

Using our improved flaky-test classifier, we identify the important tokens in the test code that influence the models in making correct or incorrect predictions. By leveraging attribution scores computed per code token in each test, we investigate the tokens that have higher impact on the flaky-test classifier’s decision-making per flaky test category. To assess the influence of these important tokens, we introduce adversarial perturbation into the tests and observe whether the model’s predictions change. Our findings show that, when perturbing the most important tokens, the classification accuracy can drop by as much as -17.04pp. This highlights the critical role of these tokens in guiding the model’s predictions.

Fri 17 Oct

Displayed time zone: Perth change

13:45 - 15:30
Testing 1OOPSLA at Orchid Plenary Ballroom
Chair(s): Karine Even-Mendoza King’s College London
13:45
15m
Talk
An Empirical Evaluation of Property-Based Testing in Python
OOPSLA
Savitha Ravi UC San Diego, Michael Coblenz University of California, San Diego
Link to publication
14:00
15m
Talk
Fray: An Efficient General-Purpose Concurrency Testing Platform for the JVM
OOPSLA
Ao Li Carnegie Mellon University, Byeongjee Kang Carnegie Mellon University, Vasudev Vikram Carnegie Mellon University, Isabella Laybourn Carnegie Mellon University, Samvid Dharanikota Efficient Computer, Shrey Tiwari Carnegie Mellon University, Rohan Padhye Carnegie Mellon University
Pre-print Media Attached
14:15
15m
Talk
Fuzzing C++ Compilers via Type-Driven Mutation
OOPSLA
Bo Wang Beijing Jiaotong University, Chong Chen Beijing Jiaotong University, Ming Deng Beijing Jiaotong University, Junjie Chen Tianjin University, Xing Zhang Peking University, Youfang Lin Beijing Jiaotong University, Dan Hao Peking University, Jun Sun Singapore Management University
14:30
15m
Talk
Interleaving Large Language Models for Compiler Testing
OOPSLA
Yunbo Ni The Chinese University of Hong Kong, Shaohua Li The Chinese University of Hong Kong
14:45
15m
Talk
Model-guided Fuzzing of Distributed Systems
OOPSLA
Ege Berkay Gulcan Delft University of Technology, Burcu Kulahcioglu Ozkan Delft University of Technology, Rupak Majumdar MPI-SWS, Srinidhi Nagendra IRIF, Chennai Mathematical Institute
15:00
15m
Talk
Tuning Random Generators: Property-Based Testing as Probabilistic Programming
OOPSLA
Ryan Tjoa University of Washington; Jane Street, Poorva Garg University of California, Los Angeles, Harrison Goldstein University at Buffalo, the State University of New York at Buffalo, Todd Millstein University of California at Los Angeles, Benjamin C. Pierce University of Pennsylvania, Guy Van den Broeck University of California at Los Angeles
DOI Pre-print
15:15
15m
Talk
Understanding and Improving Flaky Test Classification
OOPSLA
Shanto Rahman The University of Texas at Austin, Saikat Dutta Cornell University, August Shi The University of Texas at Austin