Apache tez

A unifying framework for modeling and building data processing applications

Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, Carlo Curino

Research output: Chapter in Book/Report/Conference proceedingConference contribution

76 Citations (Scopus)

Abstract

The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of engines dedicated to specific verticals. This has led to growing fragmentation and repeated efforts-with each new vertical engine re-implementing fundamental features (e.g. fault-tolerance, security, stragglers mitigation, etc.) from scratch. In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes. Tez provides a scaffolding and library components that can be used to quickly build scalable and efficient data-flow centric engines. Central to our design is fostering component re-use, without hindering customizability of the performance-critical data plane. This is in fact the key differentiator with respect to the previous generation of systems (e.g. Dryad, MapReduce) and even emerging ones (e.g. Spark), that provided an d mandated a fixed data plane implementation. Furthermore, Tez provides native support to build runtime optimizations, such as dynamic partition pruning for Hive. Tez is deployed at Yahoo!, Microsoft Azure, LinkedIn and numerous Hortonworks customer sites, and a growing number of engines are being integrated with it. This confirms our intuition that most of the popular vertical engines can leverage a core set of building blocks. We complement qualitative accounts of real-world adoption with quantitative experimental evidence that Tez-based implementations of Hive, Pig, Spark, and Cascading on YARN outperform their original YARN implementation on popular benchmarks (TPC-DS, TPC-H) and production workloads.

Original languageEnglish
Title of host publicationSIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery (ACM)
Pages1357-1369
Number of pages13
Volume2015-May
ISBN (Electronic)9781450327589
DOIs
Publication statusPublished - 27-05-2015
EventACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia
Duration: 31-05-201504-06-2015

Conference

ConferenceACM SIGMOD International Conference on Management of Data, SIGMOD 2015
CountryAustralia
CityMelbourne
Period31-05-1504-06-15

Fingerprint

Engines
Electric sparks
Fault tolerance
Ecosystems
Processing

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Cite this

Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., & Curino, C. (2015). Apache tez: A unifying framework for modeling and building data processing applications. In SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Vol. 2015-May, pp. 1357-1369). Association for Computing Machinery (ACM). https://doi.org/10.1145/2723372.2742790
Saha, Bikas ; Shah, Hitesh ; Seth, Siddharth ; Vijayaraghavan, Gopal ; Murthy, Arun ; Curino, Carlo. / Apache tez : A unifying framework for modeling and building data processing applications. SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Vol. 2015-May Association for Computing Machinery (ACM), 2015. pp. 1357-1369
@inproceedings{e7438bf21d724a17b33d505c88acc48d,
title = "Apache tez: A unifying framework for modeling and building data processing applications",
abstract = "The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of engines dedicated to specific verticals. This has led to growing fragmentation and repeated efforts-with each new vertical engine re-implementing fundamental features (e.g. fault-tolerance, security, stragglers mitigation, etc.) from scratch. In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes. Tez provides a scaffolding and library components that can be used to quickly build scalable and efficient data-flow centric engines. Central to our design is fostering component re-use, without hindering customizability of the performance-critical data plane. This is in fact the key differentiator with respect to the previous generation of systems (e.g. Dryad, MapReduce) and even emerging ones (e.g. Spark), that provided an d mandated a fixed data plane implementation. Furthermore, Tez provides native support to build runtime optimizations, such as dynamic partition pruning for Hive. Tez is deployed at Yahoo!, Microsoft Azure, LinkedIn and numerous Hortonworks customer sites, and a growing number of engines are being integrated with it. This confirms our intuition that most of the popular vertical engines can leverage a core set of building blocks. We complement qualitative accounts of real-world adoption with quantitative experimental evidence that Tez-based implementations of Hive, Pig, Spark, and Cascading on YARN outperform their original YARN implementation on popular benchmarks (TPC-DS, TPC-H) and production workloads.",
author = "Bikas Saha and Hitesh Shah and Siddharth Seth and Gopal Vijayaraghavan and Arun Murthy and Carlo Curino",
year = "2015",
month = "5",
day = "27",
doi = "10.1145/2723372.2742790",
language = "English",
volume = "2015-May",
pages = "1357--1369",
booktitle = "SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data",
publisher = "Association for Computing Machinery (ACM)",
address = "United States",

}

Saha, B, Shah, H, Seth, S, Vijayaraghavan, G, Murthy, A & Curino, C 2015, Apache tez: A unifying framework for modeling and building data processing applications. in SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. vol. 2015-May, Association for Computing Machinery (ACM), pp. 1357-1369, ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, Melbourne, Australia, 31-05-15. https://doi.org/10.1145/2723372.2742790

Apache tez : A unifying framework for modeling and building data processing applications. / Saha, Bikas; Shah, Hitesh; Seth, Siddharth; Vijayaraghavan, Gopal; Murthy, Arun; Curino, Carlo.

SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Vol. 2015-May Association for Computing Machinery (ACM), 2015. p. 1357-1369.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Apache tez

T2 - A unifying framework for modeling and building data processing applications

AU - Saha, Bikas

AU - Shah, Hitesh

AU - Seth, Siddharth

AU - Vijayaraghavan, Gopal

AU - Murthy, Arun

AU - Curino, Carlo

PY - 2015/5/27

Y1 - 2015/5/27

N2 - The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of engines dedicated to specific verticals. This has led to growing fragmentation and repeated efforts-with each new vertical engine re-implementing fundamental features (e.g. fault-tolerance, security, stragglers mitigation, etc.) from scratch. In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes. Tez provides a scaffolding and library components that can be used to quickly build scalable and efficient data-flow centric engines. Central to our design is fostering component re-use, without hindering customizability of the performance-critical data plane. This is in fact the key differentiator with respect to the previous generation of systems (e.g. Dryad, MapReduce) and even emerging ones (e.g. Spark), that provided an d mandated a fixed data plane implementation. Furthermore, Tez provides native support to build runtime optimizations, such as dynamic partition pruning for Hive. Tez is deployed at Yahoo!, Microsoft Azure, LinkedIn and numerous Hortonworks customer sites, and a growing number of engines are being integrated with it. This confirms our intuition that most of the popular vertical engines can leverage a core set of building blocks. We complement qualitative accounts of real-world adoption with quantitative experimental evidence that Tez-based implementations of Hive, Pig, Spark, and Cascading on YARN outperform their original YARN implementation on popular benchmarks (TPC-DS, TPC-H) and production workloads.

AB - The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of engines dedicated to specific verticals. This has led to growing fragmentation and repeated efforts-with each new vertical engine re-implementing fundamental features (e.g. fault-tolerance, security, stragglers mitigation, etc.) from scratch. In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes. Tez provides a scaffolding and library components that can be used to quickly build scalable and efficient data-flow centric engines. Central to our design is fostering component re-use, without hindering customizability of the performance-critical data plane. This is in fact the key differentiator with respect to the previous generation of systems (e.g. Dryad, MapReduce) and even emerging ones (e.g. Spark), that provided an d mandated a fixed data plane implementation. Furthermore, Tez provides native support to build runtime optimizations, such as dynamic partition pruning for Hive. Tez is deployed at Yahoo!, Microsoft Azure, LinkedIn and numerous Hortonworks customer sites, and a growing number of engines are being integrated with it. This confirms our intuition that most of the popular vertical engines can leverage a core set of building blocks. We complement qualitative accounts of real-world adoption with quantitative experimental evidence that Tez-based implementations of Hive, Pig, Spark, and Cascading on YARN outperform their original YARN implementation on popular benchmarks (TPC-DS, TPC-H) and production workloads.

UR - http://www.scopus.com/inward/record.url?scp=84952798078&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84952798078&partnerID=8YFLogxK

U2 - 10.1145/2723372.2742790

DO - 10.1145/2723372.2742790

M3 - Conference contribution

VL - 2015-May

SP - 1357

EP - 1369

BT - SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

PB - Association for Computing Machinery (ACM)

ER -

Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apache tez: A unifying framework for modeling and building data processing applications. In SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. Vol. 2015-May. Association for Computing Machinery (ACM). 2015. p. 1357-1369 https://doi.org/10.1145/2723372.2742790