Back To Schedule
Friday, November 17 • 2:30pm - 2:50pm
Druid Lookups for High Cardinality Dimensions

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Druid is a high-performance, column-oriented, distributed data store. Lookups are a concept in Druid where dimension values are (optionally) replaced with new values. The common use case of query-time lookups is to replace one dimension value (e.g. an ID) with another value (e.g. a human-readable Name). This is similar to a star-schema join. Druid has limited sup- port for joins through query-time lookups. Very small lookups (count of keys on the order of a few dozen to a few hundred) can be passed at query time as a "map" lookup as per dimension specs. For large lookups, Druid has an extension called Namespaced lookups. Namespaced lookups are appropriate for lookups that cannot be passed at query time due to their size, or are not desired to be passed at query time because the data is to reside in and be handled by the Druid servers. But Druid’s namespaced lookups has following limitations, • It is not suitable for high cardinality dimensions • It is not scalable for large data in the order of hundreds of millions of rows • Namespaced lookup support is limited to one key column with a corresponding value column • Real time updates to the lookup data is not possible These limitations encouraged us to develop a highly scalable, multi-column, configurable Druid lookup framework that supports real time updates on lookup data. Framework uses embeddable persistent key-value data store, kafka for messaging and HDFS for deep storage. 

avatar for Pavan


Principal Software Development Engineer, Oath

Friday November 17, 2017 2:30pm - 2:50pm PST