Loading…
Thursday, November 16 • 3:00pm - 3:40pm
Real Time ML Pipelines in Multi-Tenant Environments

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Serving Machine Learning results in real time has always been a difficult process. The inherent sophistication required by ML models is naturally at odds with the low latency requirements of a real-time pipeline. Multi-tenancy adds yet another level of complexity since instead of a few global models, tenants each require their own model trained on their respective datasets, resulting in a potentially hundreds of thousands of models (at Salesforce scale) that current big data frameworks are not designed for. On top of that, one pitfall of many ML pipelines lies in the departure of feature engineering logic between the online and offline world since even though they are accepting the same type of data, the format and process handling them can be drastically different, resulting in incorrect application of models. To address these concerns, we designed and implemented a system that isolates feature engineering of incoming data into a separate process that updates a global feature store while at the same time maintaining the computation consistent with the offline batch training process. The pre-computed features in the feature store, as well as the multi-tenant models, are then inputted into a machine learning framework we developed on top of Spark Streaming, generating scores and insights for thousands of tenants concurrently in real-time. 

Speakers
avatar for Karl Skucha

Karl Skucha

Director of Engineering, Einstein, Salesforce
avatar for Yan Yang

Yan Yang

Lead Data Engineer, Einstein, Salesforce
Lead Data Engineer, Einstein


Thursday November 16, 2017 3:00pm - 3:40pm PST
Data