Close

2019-04-24

Consistent Data Partitioning through Global Indexing for Large Apache Hadoop Tables at Uber

Consistent Data Partitioning through Global Indexing for Large Apache Hadoop Tables at Uber

Data serves little purpose if we cannot find it. Looking up individual records in the 100-plus petabytes of data accumulated at Uber lets us perform updates and gather valuable insights to help improve our services, such as delivering more accurate ETAs to riders and showing eaters their favorite food options. Querying data at this scale and having results in a timely fashion is no simple task. Still, it is essential so that teams at Uber can get the insights they need to deliver seamless and magical experiences to our customers.

We built Uber’s Big Data platform to support these insights by decoupling storage and query layers to scale each independently. We store analytical datasets on HDFS, register them as external tables, and serve them using query engines such as Apache Hive, Presto, and Apache Spark. This Big Data platform enables reliable and scalable analytics for the teams overseeing our services’ accuracy and continuous improvement.

Over the lifetime of a trip at Uber, new information gets updated to a trip datum during events such as trip creation, trip duration update, and rider review updates. Supporting an update requires looking up the location of data before modifying and persisting it. As the scale of these lookups increased to millions of operations per second, we found that open-source key-value stores could not meet our scalability requirements out of the box–they either compromise on throughput or correctness.

To reliably and consistently find the location of data, we developed a component called the Global Index. This component performs bookkeeping and lookup of the place of data in Hadoop tables. It provides high throughput, firm consistency, and horizontal scalability and facilitates our ability to update petabytes of data in Hadoop tables. In this article, we expand upon our existing Big Data series by explaining the challenges involved in solving this problem at a large scale and sharing how we leverage open-source software.

For the full article, please visit the Uber Blog at https://eng.uber.com/data-partitioning-global-indexing/

The article is by Nishith Agarwal and Kaushik Devarajaiah.