TailTracer: Continuous Tail Tracing for Production Use
Despite extensive in-house testing, bugs often escape to deployed software. Whenever a failure occurs in production software, it is desirable to collect as much execution information as possible so as to help developers reproduce, diagnose and fix the bug. To reconcile the tension between trace capability, runtime overhead, and trace scale, we propose continuous tail tracing for production use. Instead of capturing only crash stacks, we produce the complete sequence of function calls and returns. Importantly, to avoid the overwhelming stress to I/O, storage, and network transfer caused by the tremendous amount of trace data, we only retain the final segment of trace. To accomplish it, we design a novel trace decoder to support precise tail trace decoding, and an effective path-based instrumentation-site selection algorithm to reduce overhead. We implemented our approach as a tool called TailTracer on top of LLVM, and conducted the evaluations over the SPEC CPU 2017 benchmark suite, the open-source database system, and real-world bugs. The experimental results validate that TailTracer achieves low-overhead tail tracing, while providing more informative trace data than the baseline.