In Proceedings of the Workshop on Big Data Benchmarking, pages 20-27, 2013.
Big data challenges are end-to-end problems. When handling big data it usually
has to be preprocessed, moved, loaded, processed, and stored many times. This
has led to the creation of big data pipelines. Current benchmarks related to
big data only focus on isolated aspects of this pipeline, usually the
processing, storage and loading aspects. To this date, there has not been any
benchmark presented covering the end-to-end aspect for big
In this paper, we discuss the necessity of ETL like tasks in big data benchmarking and propose the Parallel Data Generation Framework (PDGF) for its data generation. PDGF is a generic data generator that was implemented at the University of Passau and is currently adopted in TPC benchmarks.