Many online data sets grow incrementally over time as new entries are slowly added and existing entries are deleted or modified. Taking advantage of this incrementality, systems for incremental bulk data processing, such as Google’s Percolator, can achieve efficient updates. This efficiency, however, comes at the price of losing compatibility with the simple programming models offered by non-incremental systems, e.g., MapReduce, and more importantly, requires the programmer to implement application-specific dynamic/ incremental algorithms, ultimately increasing algorithm and code complexity.
This project describe the architecture, implementation, and evaluation of a generic MapReduce framework, named Incoop, for incremental computations. Incoop detects changes to the inputs and enables the automatic update of the outputs by employing an efficient, fine-grained result re-use mechanism. To achieve efficiency without sacrificing transparency, we adopt recent advances in the area of programming languages to identify systematically the shortcomings of task-level memoization approaches, and address them using several novel techniques such as a storage system to store the input of consecutive runs, a contraction phase that make the incremental computation of the reduce tasks more efficient, and a scheduling algorithm for Hadoop that is aware of the location of previously computed results.
- Umut Acar
- Istemi Ekin Akkus
- Pramod Bhatotia
- Manos Kapritsos (UT Austin)
- Rafael Pasquini (Unicamp, Brazil)
- Rodrigo Rodrigues
- Alexander Wieder
- Large-scale Incremental Data Processing with Change Propagation [pdf]
Pramod Bhatotia, Alexander Wieder, Istemi Ekin Akkus, Rodrigo Rodrigues, and Umut Acar.
HotCloud’11: 3rd Usenix Workshop on Hot Topics in Cloud Computing.
- Incoop: MapReduce for Incremental Computations [pdf]
Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut Acar, and Rafael Pasquini.
SOCC’11: 2nd ACM/USENIX Symposium on Cloud Computing.