GUI-based data processing systems simplify and accelerate data tasks with a user-friendly interface, eliminating the need for extensive coding skills. This accessibility allows analysts to easily design, modify, and execute workflows with intuitive drag-and-drop operations and visual representations. Incorporating visualization operators into data processing systems to represent the processed result enables analysts to quickly gain insights, understand patterns, and make informed decisions from complex data. As analysts observe the results, they may uncover new trends, leading to further questions or hypotheses that require modifications and edits to the workflow. Each change to the workflow generates a new version. Given the iterative nature of data analytics, modifying workflows is a common practice. The results produced from executing these versions are materialized, enabling users to refer back to them to reproduce and replicate past experiments, ensuring the validity of reported outcomes. While striving for improved results, in many cases, the results of new iterations are equivalent to those of previous runs. Given the significant time required to execute analytical tasks on large datasets, it becomes imperative to reduce redundant computations by reusing previously-stored results. Hence, it is crucial to identify and verify the equivalence of results across different runs.
This dissertation is driven by these pressing needs to enhance iterative data analytics within GUI-based data processing systems by integrating visualization, version control, and result reuse. The dissertation is structured into four main parts.
The first part addresses the challenge of incrementally visualizing large spatial networks while minimizing visual clutter. To tackle this issue, we introduce GSViz, a general-purpose middleware-based solution consisting of two modules, namely edge-aware vertex clustering and incremental edge bundling to effectively visualize large spatial networks.
The second part focuses on the development of Drove, a framework designed to track changes in workflows, environment dependencies, workflow executions, and the generated results. By utilizing Drove, researchers and analysts can gain valuable insights into the evolution of workflows and understand the impact of modifications on the final outcomes.
In the third part, we present Veer, an algorithm for verifying the equivalence of two complex workflow versions. Additionally, we present a series of optimization techniques to improve the performance of the baseline algorithm.
Lastly, we introduce Raven, an optimization framework that ranks the previously executed workflow versions then it tests their equivalence compared to a new workflow version execution request. By reusing the results generated from these versions, Raven minimizes redundant computations and significantly enhances performance when handling new workflow execution requests. Raven retrieves the previous versions from Drove and pushes testing their equivalence to Veer.