Parallel and concurrent software sometimes exhibit incorrect behavior because of
unintended interference between different threads of execution. Common classes
of concurrency bugs include data races, deadlocks, and atomicity violations.
These bugs are often non-deterministic and hard to find without sophisticated
tools. We present Active Testing, a methodology to effectively find concurrency
bugs that scales to large distributed memory parallel systems.
Active Testing combines the coverage and predictive power of program analysis
with the familiarity of testing. It works in two phases: in the predictive
analysis phase, a program is executed and monitored for potential concurrency
bugs and in the testing phase, Active Testing re-executes the program while
controlling the thread schedules in an attempt to reproduce the bug predicted in
the first phase.
We have implemented Active Testing for multi-threaded Java programs in the
CalFuzzer framework. We have also developed UPC-Thrille, an Active Testing framework
for Unified Parallel C (UPC) programs written in the Single Program Multiple
Data (SPMD) programming model combined with the Partitioned Global Address Space
(PGAS) memory model. We explain in detail the design decisions and optimizations
that were necessary to scale Active Testing to thousands of cores. We present
extensions to UPC-Thrille that support hybrid memory models as well.
We evaluate the effectiveness of Active Testing by running our tools on several
Java and UPC benchmarks, showing that it can predict and confirm real
concurrency bugs with low overhead. We demonstrate the scalability of Active
Testing by running benchmarks with UPC-Thrille on large clusters with thousands
of cores.