Qualifications: PhD, Computer Science, Rice University and BTech, Computer Science, IIT Kharagpur
Title: Senior Researcher
Affiliation: IBM Research India, Bangalore
Contact Details: Animesh.Nandi@in.ibm.com
Short CV: Animesh Nandi is a senior researcher at IBM Research India. His research agenda revolves around building scalable Big Data platforms for IT Operational Analytics (ITOA) that can collect and analyze large volumes of historical machine-data in a datacenter/cloud to enable root-cause diagnosis of failures. His recent research revolves around building efficient indexing/storage techniques in Big Data platforms for supporting time-travel queries on the historic state of the datacenter, and building scalable data-mining algorithms that can pinpoint the root cause of failures using historical log-data and fine-grained machine data.
Before this, he was a researcher at Bell Labs Research India, Alcatel-Lucent (2009–2013), where he conceptualized and led the P3 project, which built a novel decentralized cloud middleware over federated non-colluding proxies for enabling privacy preserving personalized services. He was awarded the prestigious MIT's TR35-India Young Innovators award in 2012 for the cutting-edge technology innovations in the P3 project, and gave invited talks on this work at multiple venues including ACM's Alan Turing Centenary Event organized by ACM India Chapter. He has won multiple awards from Alcatel-Lucent.
Animesh's research interests and expertise lie at the intersection of Internet-scale networked and distributed systems and distributed data mining, and applications of these for building scalable platforms and algorithms for data analytics and data-privacy. He already has more than 18 patents accepted for filing, and has more than 12 research papers in top international conferences. He got his PhD and Master's from Rice University (with significant time spent at Max Planck Institute for Software Systems, Germany), and a BTech from Indian Institute of Technology (IIT), Kharagpur.
Title of Talk 1: Enabling Efficient Time-Travel Primitives in Big Data Platforms
Synopsis: Big Data platforms for IT Operational Analytics (ITOA) that collect historical machine-data from a datacenter in order to do problem diagnosis, require being able to efficiently analyze the historical state of a datacenter using "time-travel" queries. In spite of work in the area of temporal databases that enabled a limited form of time-travel queries on a relational database, the newer era of Big Data platforms requires supporting new types of queries and thereby need new capabilities for supporting efficient time-travel. Our long-term goal in this research is to identify techniques for enabling efficient time-travel in a wide-variety of Big Data platforms of today - from temporal text search-index, to temporal graph-databases, to temporal NoSQL key/value stores, and more. We have kickstarted this research by developing a technique of supporting time-travel over a keyword/text based search-index, and built a datacenter search application that offers a time-travel text search query interface over historical machine data collected from a datacenter. We have developed our technology as extensions to the open-source Lucene-based search-index Solr. In addition to supporting time-travel over a text/keyword based search-index, we are now currently working on enabling efficient historical time-travel graph queries on graph databases that store temporally evolving graphs.
Title of Talk 2: Detecting Anomalous System Behaviour Using Historical Log Data
Synopsis: The IT Operational Analytics (ITOA) platforms of today collect system/application log-data and display the errors/warnings in logs of faulty modules as an aid to problem diagnosis. Using the errors/warnings in log messages however suffers from two limitations. First, several times, errors/warnings are benign and are thrown even in healthy state of a system. Second, several times, faults do not manifest themselves via printing of errors/warnings. Our research goal is to develop robust techniques of detecting and pinpointing anomalous system behaviour. Our approach is to use the log data to mine a model of healthy system behaviour in terms of typical program control flow graph (CFG) exhibited within/across modules of a distributed system, and detect anomalous behaviour as violations to this healthy CFG. In contrast to prior works that require program instrumentation to embed parameters or rely on application-specific knowledge of parameters being used across method/API calls, our approach focusses on how to mine such CFGs in a generic way by exploiting statistical temporal correlations of log messages. The challenge of using temporal correlations however is having to deal with the high amount of interleaved log traces. We develop a solution to this problem in this research, and develop techniques to use the mined CFG for raising anomaly alerts by detecting the location in the CFG where the expected sequence of events got violated.
Title of Talk 3: Towards Automating Root-Cause Diagnosis in a Datacenter via Analysis of Historical Fine-grained Machine-State Data
Synopsis: The eventual goal of IT Operational Analytics (ITOA) platforms is to enable fast and accurate root-cause-analysis (RCA) in a cloud/datacenter. RCA in spite of being a decade-old problem, is still miles way from being fully automated. Although current RCA systems can track faulty components using metric data and then enable zooming into the problem using errors/warnings in the log data, current techniques leave a lot of the heavy lifting of identifying the actual root-cause of the problem to manual investigation by a subject matter expert. In our research, we are revisiting the RCA problem with a new perspective - we make the case that periodically collecting fine-grained machine-state data at the operating system level (processes, packages, connections, files, etc.) can enable accurate pinpointing of the root-cause without any human intervention. Although using fine-grained machine-state data represents an opportunity for fully automating RCA, the number of fine-grained machine-state entities is several orders of magnitude larger than the number of metrics or logs being monitored by current systems. In this talk, we describe our ongoing research and initial results towards fully automating RCA using fine-grained machine-state data.