The consistent growth of DRAM memory bandwidth and capacity has enabled the computation of increasingly larger workloads in high-performance computing. However, the memory latency improvement over time is nominal, which severely bottlenecks the performance of modern systems. Modern computers rely on the exploitation of data locality using large cache hierarchies to keep the throughput high. Additionally, the effect of Dennard scaling is getting more pronounced in processor technologies, prompting the architects to employ multicore and multithreaded architectures to boost throughput by extracting parallelism. However, irregular applications do not enjoy the same performance gain from these techniques. The poor data locality in these applications renders the cache ineffective, and the dynamic data dependence causes sub-optimal parallel processing. This motivates the researchers to look beyond conventional CPU and GPU platforms to dedicated hardware accelerators for these applications.
In this dissertation, we examine two irregular applications – Parallel Discrete Events Simulation (PDES) and Graph Processing – to study their limitations in conventional systems and develop strategies for solving these challenges using hardware primitives to ultimately build generalized accelerator frameworks for these applications. We design dataflow architectures founded upon an event-driven execution model to increase the parallelism and memory bandwidth utilization in these applications.
The first application, PDES, is inherently event-driven. It demonstrates ordered irregular parallelism, which is characterized by strict ordering between tasks requiring stricter synchronization that constrains parallelism. First, we design an efficient hardware architecture for priority queues, which is the core component of an event-driven system. Then, we build a decoupled datapath with relaxed synchronization for robust transmission of events between the task processing logics and the queue in an optimistic approach that masks long memory access latency and promotes parallelism. This accelerator, named PDES-A, achieves 3.2x speedup over conventional 12-core Intel Xeon CPU with less than 15% of power consumption.
The second application, graph processing, possesses similar dynamic data dependence but with partial ordering between the tasks. We convert the traditional iterative execution model into an event-driven execution model. We demonstrate that fine-grained control over dynamic memory access patterns is achievable through strategic manipulation of events to maximize spatial locality and, hence, memory bandwidth utilization. We design GraphPulse, an asynchronous graph processing framework in hardware, exhibiting 28 times speedup over software framework running on a 12-core Intel Xeon CPU.
Finally, we study a more complex application, streaming graph analytics, to illustrate the extensibility of the event-driven model. We employ incremental recomputation of streaming data to reduce redundant computation. The application has multiple types of computation tasks and requires multi-phase execution. We develop an incremental graph computation algorithm suitable for the event-driven paradigm and subsequently extend GraphPulse to support streaming graphs using this algorithm. The streaming graph analytics accelerator, JetStream, can evaluate queries on streaming graphs 13x faster than a complete recomputation using GraphPulse. It also outperforms state-of-the-art software streaming graph frameworks by 18x on average.