Qualifications: B.Tech, Computer Science, IIT Delhi; M.S., Engg. Mgmt., Santa Clara University; Ph.D., Computer Science, IIT Delhi
Title: Senior Research Scientist
Affiliation: IBM Research Labs, New Delhi
Contact Details: email@example.com
Title of Talk 1: Distributed Scheduling for Massively Parallel Systems
Further, movement of massive amounts (Terabytes to Petabytes) of data is very expensive, which necessitates affinity driven computations. Therefore, distributed scheduling of parallel computations on multiple places needs to optimize multiple performance objectives: follow affinity maximally and ensure efficient space, time and message complexity. Further, achieving good load balancing can be contradictory to ensuring affinity which leads to challenging trade-offs in distributed scheduling. In addition, parallel computations have data dependent execution patterns which requires online scheduling to effectively optimize the computation orchestration as it unfolds. With continuous demand of processing larger and larger data volumes (from petabytes to exabytes and beyond), one needs to ensure data scalability along with scalability with the respect to number of compute nodes and cores in the target system. Thus, the scheduling framework needs to consider IO bottlenecks along with compute and memory bandwidth bottlenecks in the system to enable strong scalability and performance. Simultaneous consideration of these objectives makes distributed scheduling a particularly challenging problem.
With the advent of distributed memory architectures, lot of recent research on distributed scheduling looks at multi-core and many-core clusters. A dynamic tasking library (HotSLAW) was developed for many-core clusters that uses topology-aware hierarchical work stealing strategy for both NUMA and distributed memory systems. All these recent efforts primarily achieve load balancing using (locality-aware) work stealing across the nodes in the system. Although this strategy works well for slightly irregular computation such as UTS for geometric tree, it could result in parallel inefficiencies when the computation is highly irregular (binomial tree for UTS) or when there are complicated trade-offs between affinity and load-balance as in sparse matrix benchmark such as Conjugate Gradient benchmark. Certain other approaches such as consider limited control and no data-dependencies in the parallel computation, which limits the scope of applicability of the scheduling framework.
In this talk, we present a novel distributed scheduling framework and algorithm (LDS) for multi-place parallel computations, that uses a unique combination of remote (inter-place) spawns and remote work steals to reduce the overheads in the scheduler, which helps to dynamically maintain load balance across the compute nodes of the system, while ensuring affinity maximally. Our design was implemented using GASNet API and POSIX threads. On affinity and load-balance oriented benchmarks such as CG (Conjugate Gradient) and Kmeans clustering, we demonstrate strong performance and scalability on 2048 node BG/P. Using benchmarks such as UTS we show that LDS has lower space requirement than hierarchical work-stealing based approaches such HotSLAW and better performance than Charm++. We also explore how distributed machine learning can help in performance tuning of the distributed scheduling framework.
Title of Talk 2: Research Directions in Large-Scale Inverse Problems
This talk will present an overview of start-of-the-art techniques as well as research challenges that must be overcome to realize the promise of inference of large-scale complex models from large-scale complex data.