Correct Black-Box Monitors for Distributed Deadlock Detection: Formalisation and Implementation
This program is tentative and subject to change.
Many software applications rely on concurrent and distributed (micro)services that interact via message- passing and various forms of remote procedure calls (RPC). As these systems organically evolve and grow in scale and complexity, the risk of introducing deadlocks increases and their impact may worsen: even if only a few services deadlock, many other services may block while awaiting responses from the deadlocked ones. As a result, the “core” of the deadlock can be obfuscated by its consequences on the rest of the system, and diagnosing and fixing the problem can be challenging.
In this work we tackle the challenge by proposing distributed black-box monitors that are deployed alongside each service and can detect deadlocks by only observing the incoming and outgoing messages. We present a formal model that captures popular RPC-based application styles (e.g., gen_servers in Erlang/OTP), and we formalise a distributed black-box monitoring algorithm that we prove sound and complete (i.e., identifies deadlocked services without false positives nor false negatives). We also implement our result in a tool called DDMon for the monitoring of Erlang/OTP applications, and we evaluate its performance.
This is the first work that formalises, proves the correctness, and implements distributed black-box monitors for deadlock detection. Our formalisation and theoretical results are mechanised in Coq. DDMon is the companion artifact of this paper.