Understanding and Improving Flaky Test Classification
Regression testing is an essential part of software development, but it suffers from the presence of flaky tests - tests that pass or fail non-deterministically on the same codebase. These unpredictable failures can waste developers’ time, and hide real bugs. While prior work has explored using fine-tuned large language models (LLMs) to classify flaky test categories with near-perfect accuracy, we find that these results are significantly overestimated due to flawed experimental design and unrealistic datasets.
In this paper, we first show that prior flaky-test classifiers over-estimate the prediction accuracy due to 1) flawed experiment design and 2) mis-representation of the real distribution of flaky (and non-flaky) tests in their datasets. After we fix design flaws and construct a more realistic dataset, we observe a tremendous drop in F1-score, from 85.38% to 56.62%. Motivated by these observations, we develop a new training strategy to fine-tune a new flaky-test classifier, FlakyLens, that improves the classification F1-score to 67.50% (10.88pp higher than the state-of-the-art). We also compare FlakyLens against recent state-of-the-art pre-trained LLMs on the same classification task. Our results show that FlakyLens consistently outperforms these models, highlighting that general-purpose LLMs still fall short on this specialized task.
Using our improved flaky-test classifier, we identify the important tokens in the test code that influence the models in making correct or incorrect predictions. By leveraging attribution scores computed per code token in each test, we investigate the tokens that have higher impact on the flaky-test classifier’s decision-making per flaky test category. To assess the influence of these important tokens, we introduce adversarial perturbation into the tests and observe whether the model’s predictions change. Our findings show that, when perturbing the most important tokens, the classification accuracy can drop by as much as -17.04pp. This highlights the critical role of these tokens in guiding the model’s predictions.
Fri 17 OctDisplayed time zone: Perth change
13:45 - 15:30 | |||
13:45 15mTalk | An Empirical Evaluation of Property-Based Testing in Python OOPSLA Link to publication | ||
14:00 15mTalk | Fray: An Efficient General-Purpose Concurrency Testing Platform for the JVM OOPSLA Ao Li Carnegie Mellon University, Byeongjee Kang Carnegie Mellon University, Vasudev Vikram Carnegie Mellon University, Isabella Laybourn Carnegie Mellon University, Samvid Dharanikota Efficient Computer, Shrey Tiwari Carnegie Mellon University, Rohan Padhye Carnegie Mellon University Pre-print Media Attached | ||
14:15 15mTalk | Fuzzing C++ Compilers via Type-Driven Mutation OOPSLA Bo Wang Beijing Jiaotong University, Chong Chen Beijing Jiaotong University, Ming Deng Beijing Jiaotong University, Junjie Chen Tianjin University, Xing Zhang Peking University, Youfang Lin Beijing Jiaotong University, Dan Hao Peking University, Jun Sun Singapore Management University | ||
14:30 15mTalk | Interleaving Large Language Models for Compiler Testing OOPSLA | ||
14:45 15mTalk | Model-guided Fuzzing of Distributed Systems OOPSLA Ege Berkay Gulcan Delft University of Technology, Burcu Kulahcioglu Ozkan Delft University of Technology, Rupak Majumdar MPI-SWS, Srinidhi Nagendra IRIF, Chennai Mathematical Institute | ||
15:00 15mTalk | Tuning Random Generators: Property-Based Testing as Probabilistic Programming OOPSLA Ryan Tjoa University of Washington; Jane Street, Poorva Garg University of California, Los Angeles, Harrison Goldstein University at Buffalo, the State University of New York at Buffalo, Todd Millstein University of California at Los Angeles, Benjamin C. Pierce University of Pennsylvania, Guy Van den Broeck University of California at Los Angeles DOI Pre-print | ||
15:15 15mTalk | Understanding and Improving Flaky Test Classification OOPSLA Shanto Rahman The University of Texas at Austin, Saikat Dutta Cornell University, August Shi The University of Texas at Austin | ||