Loading…
Saturday, November 18 • 9:50am - 10:30am
Democratizing data with an internal data pipeline platform

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
This talk is about why and how we built our internal data pipeline platform.

At Indix we have data in different formats - html pages, thrift records, avro records and the usual culprits - CSVs and other plain text formats. We have data in TBs and in a few KBs and data consisting of billions of records and data consisting of a few hundred rows. And all this data - in one form or another - is consumed by the engineers, the product managers, the customer success team and even our CEO.

Our biggest challenge was in knowing which data exists and where, and how to access it efficiently while balancing costs and productivity of the people involved. We had to make do with adhoc Scalding jobs. There was no single place where people can discover the different "datasets" that we had, what format they were in, where they were stored and how frequently a new version was published. Running jobs was also not straightforward since things like finding a cluster to use were not trivial. In order to democratize the access to data and make it easy for anyone within the organization to work and play with the data we had, we went about building a data pipeline platform for our internal users.

Leveraging the power of Spark, the platform allows the users to define datasets (along with their schema) and create pipelines to work with the datasets. The pipelines can be configured via a wizard based UI or a JSON config and all the jobs run on dedicated and auto scaled Spark clusters. Predefined transformations to filter, project, sample and even type in sql queries have made it powerful but simple to use for any type of user. Support for S3, Sftp and even Google sheets made it usable for different internal and customer use cases. The platform also enables us to load the same data and perform similar operations on them via notebooks with just couple of lines of client code. Today we run over 300 pipelines across over 100 datasets and thousands of versions of the datasets using this platform.

The data pipeline platform has truly changed the way we ingest, manipulate, analyze and egress data across the organization, and is on course to be converted into a self-serve platform for our (external) customers too.

Speakers
avatar for Manoj Mahalingam

Manoj Mahalingam

Principal Engineer, Indix


Saturday November 18, 2017 9:50am - 10:30am PST
Data