Statically Analyzing the Dataflow of R Programs (SPLASH 2025 - OOPSLA)

Sun 12 - Sat 18 October 2025 Singapore

co-located with ICFP/SPLASH 2025

Who

Florian Sihler, Matthias Tichy

Track

SPLASH 2025 OOPSLA

Time Zone

The program is currently displayed in (GMT+08:00) Perth.

Use conference time zone: (GMT+08:00) PerthSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 17 Oct 2025 11:15 - 11:30 at Orchid Plenary Ballroom - Compilation 3 Chair(s): Hidehiko Masuhara

Abstract

The R programming language is primarily designed for statistical computing and mostly used by researchers without a background in computer science. R provides a wide range of dynamic features and peculiarities that are difficult to analyze statically like dynamic scoping and lazy evaluation with dynamic side-effects. At the same time, the R ecosystem lacks sophisticated analysis tools that support researchers in understanding and improving their code. In this paper, we present a novel static dataflow analysis framework for the R programming language that is capable of handling the dynamic nature of R programs and produces the dataflow graph of given R programs. This graph can be essential in a range of analyses, including program slicing, which we implement as a proof of concept. The core analysis works as a stateful fold over a normalized version of the abstract syntax tree of the R program, which tracks (re-)definitions, values, function calls, side effects, external files, and a dynamic control flow to produce one dataflow graph per program. We evaluate the correctness of our analysis using output equivalence testing on a manually curated dataset of 779 sensible slicing points from executable real-world R scripts. Additionally, we use a set of systematic test cases based on the capabilities of the R language and the implementation of the R interpreter and measure the runtime as well as the memory consumption on a set of 4,230 real-world R scripts and 20,815 packages available on R’s package manager CRAN. Furthermore, we evaluate the recall of our program slicer, its accuracy using shrinking, and its improvement over the state of the art. We correctly analyze almost all programs in our equivalence test suite, preserving the identical output for 99.7 % of the manually curated slicing points. On average, we require 153.2 ms to analyze the dataflow of a given file and around 211.6 kB for the dataflow graph. This shows that our analysis is capable of analyzing real-world sources quickly and correctly. Our slicer achieves an average reduction of 84.8 % of tokens indicating its potential to improve program comprehension.

File attachments

Slides (oopsla-presentation.pdf)	2.22MiB

Florian Sihler

Ulm University

Germany

Matthias Tichy