Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Data [clear filter]
Thursday, November 16

9:50am PST

Turning a Relational RDBMS Table into a Spark Datasource
This session presents a Spark DataSource implementation for integrating (joining) Big Data in HDFS or NoSQL DBMS with Master Data in RDBMS table. The session describes how to allow parallel and direct access to RDBMS tables from Spark, generate partitions of Spark JDBCRDDs based on the split pattern and rewrites Spark SQL queries into the RDBMS SQL dialect. The session also describes the performance optimizations including hooks in the JDBC driver for faster type conversions, pushing down predicates to the RDBMS, pruning partition based on the where clause, and projecting columns to the RDBMS table to reduce the amount of data returned and processed on Spark.

avatar for Kuassi Mensah

Kuassi Mensah

Director Product Management, Oracle Corporation
Kuassi is Director of Product Management at Oracle. He covers the following product areas (i) Java connectivity to DB  (Cloud,  on-premises): JDBC, JDBC Reactive Extensions, in-place DB processing with embedded JVM  (ii) Zero downtime, multi-tenancy, and sharding for Java apps... Read More →

Thursday November 16, 2017 9:50am - 10:30am PST

10:40am PST

An introduction to Xtract
Xtract is a simple, easy to use scala XML extraction/deserialization library modeled after the JSON library in the Play framework. It uses functional style composition to combine simple parsers into more complex parsers. Lucid Software has successfully used it in our implementation of importing Indesign documents into Lucidpress.

avatar for Thayne McCombs

Thayne McCombs

Senior Software Engineer, Lucid Software
I am currently a Dev-Ops engineer at Lucid Software, Inc, put previously worked as a predominantly front-end engineer for Lucidpress, where I led the implementation of importing Indesign documents into Lucidpress. I majored in Astrophysics at Brigham Young University.

Thursday November 16, 2017 10:40am - 11:00am PST

11:10am PST

Scala DSL for ML Training Set Stratification
Building flexible machine learning libraries adapted for Netflix’s use cases is paramount in our continued efforts to better model user behaviors and provide great personalized recommendations. This talk introduces one such scala-based DSL library to aid “User Training Set Stratification” in our offline machine learning workflows. Originally created to improve user stratification while building our fact store, the library has evolved to cater to other general-purpose stratification use cases in our ML applications. We will talk about how using the library’s scala-based DSL and its underlying Apache Spark based implementation, one can easily express and dynamically generate the required training data sets for different ML experiment needs by specifying the desired distributions of user attributes such as country, tenure, play frequency etc. The demo section of the talk will showcase how we were to able to utilize idiomatic scala with several API examples in a Zeppelin notebook.

avatar for Shiva Chaitanya

Shiva Chaitanya

Senior Software Engineer, Netflix

Thursday November 16, 2017 11:10am - 11:30am PST

11:40am PST

Building a High-Performance Database with Scala, Akka, and Spark
#distributedsystems #scala #akka #spark #FiloDB #cassandra

Scala and its large ecosystem of libraries are increasingly being used to build highly scalable and performant data systems. In this talk, I share years of experience building high performance data systems using Scala, Akka, and Spark, plus recent experience building FiloDB, a high performance analytics database built on these technologies. How do we balance Scala and functional programming with very high performance demands? What are some tips to watch out for when building very very fast Scala code?
  • Why build a new database for streaming applications?
  • Why Scala and Akka makes a great foundation for building a database
  • When to use Futures, Actors, Reactive Streams
  • Using Akka Cluster to coordinate and implement distributed ingestion
  • Monix and use of reactive streams
  • Reactive/async tracing and production metrics
  • Filo: summing integers at billions of ops per second, taking advantage of processor cache and SIMD with super fast vector operations
  • Serialization, GC, and off-heap: how to leverage binary data structures for the win - JVM method dispatch, inlining, and writing lots of small methods

avatar for Evan Chan

Evan Chan

Senior Data Engineer, UrbanLogiq
Evan is currently Senior Data Engineer at UrbanLogiq, where he is using Rust, among other tools, in building robust data platforms to help public servants build better communities. Evan has been a distributed systems / data / software engineer for twenty years. He led a team developing... Read More →

Thursday November 16, 2017 11:40am - 12:20pm PST

1:10pm PST

Avoiding Spark Pitfalls at Scale
There’s no doubt that Apache Spark is a very powerful tool for scalable data, but beware, forces lurk in the shadows to bring upon your downfall, especially so at scale! In this talk, we’ll talk about some of the challenges and pitfalls encountered when writing data pipelines with Spark and how we’ve learned to deal with them. Our tales will involve battles with memory management, dataset typesafety, lazy versus strict evaluations, and beyond. This talk will use only the Scala API of Spark, but the tips and tricks presented will apply to Spark in general.

avatar for Long Cao

Long Cao

Software Engineer, Coatue Management
Long is a software engineer on the data science team at Coatue Management, where he builds scalable data pipelines in Scala and Spark that consume alternative data to provide insight and market signals. He has been based in New York for the last 5 years by way of Texas and obsesses... Read More →

Thursday November 16, 2017 1:10pm - 1:30pm PST

1:40pm PST

Satellite data monitoring for analytical models
Astro Digital API streams satellite data for change monitoring. The API allows to setup monitoring of the whole World and get a time series of daily-updated satellite images at scientific quality ready for mapping and plugging into analytical models. All the necessary cross-calibrations and imagery processing is done by Astro Digital platform so any developer can stream the data directly to the users via imagery files and cloud-based web maps. Astro Digital API output dataset is ready to be built into analytical algorithms. It enables ML and AI developers add context and a history of changes to the process of building the models. The API is the interface of cloud-based Astro Digital platform for searching, processing and distributing satellite data allowing to monitor territories of any size up in any location in the World. The data feed is based on the Public Domain sources and Astro Digital own constellation of Landmapper satellites with a launch of the first batch in June, 2017. The API allows to get data points to up to 17 years back in history, get newly appeared data points in less than 24 hours after the shot is taken and set up future-looking "alerts" that are updated once a new image for the location is available. Astro Digital API streams satellite data for persistent change monitoring of live resources like agriculture, forest and algae, and urban constructions like buildings, roads and pipelines. The API allows to setup monitoring of any location in the World and get a time series of satellite data products at scientific quality. All the necessary cross-calibrations and imagery processing is done by Astro Digital platform so any developer can stream the data directly to the users via imagery files and cloud-based web maps. Astro Digital API output dataset is ready to be built into analytical algorithms. It enables ML and AI developers add context and a history of changes to the process of building the models. The API is the interface of cloud-based Astro Digital platform for searching, processing and distributing satellite data allowing to monitor territories of any size up in any location in the World. The data feed is based on the Public Domain sources and Astro Digital own constellation of Landmapper satellites with a launch of the first batch in June, 2017. The API allows to get data points to up to 17 years back in history, get newly appeared data points in less than 24 hours after the shot is taken and set up future-looking "alerts" that are updated once a new image for the location is available.

avatar for Alex Kudriashova

Alex Kudriashova

Integration Lead, Astro Digital

Thursday November 16, 2017 1:40pm - 2:00pm PST

2:10pm PST

Featran77 - Generic Feature Transformer for Data Pipelines
Featran, a.k.a. Feature Transformer, Featran77 or F77, is a generic feature engineering library for Scala data pipeline frameworks, including Scio, Spark, Scalding and Flink. We'll talk about the design and implementation of the library, including uses of Algebird Semigroups, Aggregators, Breeze, and Scalacheck. We'll also cover other relevant topics that makes our data/ML pipelines scalable and type safe.

avatar for Neville Li

Neville Li

Software Engineer, Spotify
Neville is a software engineer at Spotify who works mainly on data infrastructure and tools for machine learning and advanced analytics. In the past few years he has been driving the adoption of Scala and new data tools for music recommendation, including Scalding, Spark, Storm and... Read More →

Thursday November 16, 2017 2:10pm - 2:50pm PST

3:00pm PST

Real Time ML Pipelines in Multi-Tenant Environments
Serving Machine Learning results in real time has always been a difficult process. The inherent sophistication required by ML models is naturally at odds with the low latency requirements of a real-time pipeline. Multi-tenancy adds yet another level of complexity since instead of a few global models, tenants each require their own model trained on their respective datasets, resulting in a potentially hundreds of thousands of models (at Salesforce scale) that current big data frameworks are not designed for. On top of that, one pitfall of many ML pipelines lies in the departure of feature engineering logic between the online and offline world since even though they are accepting the same type of data, the format and process handling them can be drastically different, resulting in incorrect application of models. To address these concerns, we designed and implemented a system that isolates feature engineering of incoming data into a separate process that updates a global feature store while at the same time maintaining the computation consistent with the offline batch training process. The pre-computed features in the feature store, as well as the multi-tenant models, are then inputted into a machine learning framework we developed on top of Spark Streaming, generating scores and insights for thousands of tenants concurrently in real-time. 

avatar for Karl Skucha

Karl Skucha

Director of Engineering, Einstein, Salesforce
avatar for Yan Yang

Yan Yang

Lead Data Engineer, Einstein, Salesforce
Lead Data Engineer, Einstein

Thursday November 16, 2017 3:00pm - 3:40pm PST

4:00pm PST

Introduction to Machine Learning
Machine Learning is all the rage today with many different options and paradigms. This session will walk through the basics of Machine Learning and show how to get started with the open source Spark ML framework. Through Scala code examples you will learn how to build and deploy learning systems like recommendation engines.

avatar for James Ward

James Ward

Engineering and Open Source Ambassador, Salesforce
James Ward (www.jamesward.com) is the Engineering and Open Source Ambassador at Salesforce.com. James frequently presents at conferences around the world such as JavaOne, Devoxx, and many other Java get-togethers. Along with Bruce Eckel, James co-authored First Steps in Flex. He has... Read More →

Thursday November 16, 2017 4:00pm - 4:40pm PST
Friday, November 17

9:50am PST

Apache SystemML: State of the Project and Future Plans
Apache SystemML is a system and language that supports rapid development of custom machine learning algorithms for large scale problems. SystemML allows data scientists to write code once in terms of high-level linear algebra operations, then automatically generate low-level parallel versions of the program that are tuned to the characteristics of the data and different parallel execution frameworks. The system consists of two major components: An optimizer that automatically parallelizes high-level code; and a runtime that evaluates the resulting execution plans at scale on Apache Hadoop, on Apache Spark, on large multi-core systems, and, more recently, on GPUs. This talk will start by describing the history of the project. I'll explain how the original research team from IBM advanced the state of the art in automatic parallelization and scalable linear algebra to build the optimizer and runtime, and how we turned the resulting research code into Apache SystemML. I'll describe how Apache SystemML has been used to implement state-of-the-art algorithms in the field. Finally, I'll talk about recent work on enhancing the system with compressed linear algebra, automatic generation of custom linear algebra kernels, and support for deep learning.

avatar for Fred Reiss

Fred Reiss

Chief Architect, IBM Spark Technology Center

Friday November 17, 2017 9:50am - 10:30am PST

10:40am PST

Does Your Privacy Scale?
The state and nature of privacy is changing and getting more difficult for companies, but better for consumers. This is an not a Product Management issue. These new regulations are putting restrictions on how Engineers build systems. You hear terms like Privacy by Design from regulators. What does this mean? In this talk, I will do a survey of the overall Privacy trends and dig into specific regulations that impact how systems are designed and built.

avatar for Devin Loftis

Devin Loftis

VP Engineering, ValiMail

Friday November 17, 2017 10:40am - 11:00am PST

11:10am PST

Futures: Twitter vs Scala
It's been enough of confusion around Twitter Futures. Let's clear the air and talk frankly about historic and technical reasons they exist. We'll see how the difference in the API, behavior, and performance not only makes Twitter Futures competitive with Scala Futures but also the obvious choice for IO systems with the corresponding degree of throughput requirements (i.e., Finagle).

avatar for Vladimir Kostyukov

Vladimir Kostyukov

Software Engineer, Twitter, Inc
Hacking Finagle @Twitter.

Friday November 17, 2017 11:10am - 11:30am PST

11:40am PST

Fantastic ML apps and how to build them
Building efficient machine learning applications is not a simple task. The typical engineering process is an iteration of data wrangling, feature generation, model selection, hyperparameter tuning and evaluation. The amount of possible variations of input features, algorithms and parameters makes it too complex to perform efficiently even by experts. Automating this process is especially important when building machine learning applications for thousands of customers. In this talk I demonstrate how we build effective ML models using AutoML capabilities we develop at Salesforce. Our AutoML capabilities include techniques for automatic data processing, feature generation, model selection, hyperparameter tuning and evaluation. I present several of the implemented solutions with Scala and Spark.

avatar for Matthew Tovbin

Matthew Tovbin

Principal Engineer, Salesforce Einstein, Salesforce
Matthew Tovbin is a Principal Member of Technical Staff at Salesforce, engineering Salesforce Einstein AI platform, which powers the world’s smartest CRM. Before joining Salesforce, he acted as a Director of Engineering at Badgeville, implementing scalable and highly available real-time... Read More →

Friday November 17, 2017 11:40am - 12:20pm PST

1:10pm PST

VEGAS: The missing Matplotlib for Spark
In this talk, we'll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.

avatar for Roger Menezes

Roger Menezes

Senior Research Engineer, Netflix

Friday November 17, 2017 1:10pm - 1:30pm PST

1:40pm PST

Lawful AI
Modern practical data science, NLP, and AI has almost zero overlap with pure functional programming.  Why would this be a good thing and what can the Scala community do to help?

avatar for Adam Pingel

Adam Pingel

Senior Director of Software Engineering, LexisNexis (via Ravel Law)

Friday November 17, 2017 1:40pm - 2:00pm PST

2:10pm PST

A Tale of Two Graph Engines on Spark: GraphFrames and TinkerPop OLAP
Graph is on the rise and it's time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif API as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we'll be explaining everything from the ground up!

avatar for Russell Spitzer

Russell Spitzer

Software Engineer, DataStax
Spark, Cassandra, or Dogs.

Friday November 17, 2017 2:10pm - 2:50pm PST

3:00pm PST

Continuous Delivery Principles for Machine Learning
Real world Software Engineering is an iterative process and one of its main objectives is to get changes all of types - including new features, configuration changes, bug fixes and experiments into production and into the hands of the users, safely, quickly and in a sustainable way. Continuous Delivery (CD), a software engineering discipline, with its principled approach allows you to solve this exact problem. The core idea of CD is to create a repeatable, reliable and incrementally improving process for taking software from concept to the end user. Like software development, building real world machine learning (ML) algorithms is an also an iterative process with a similar objective - How do I get my ML algorithms into production and in the hands of the users in a safe, quick and sustainable way. The current process of building models, testing and deploying them into production is at best an ad-hoc process in most companies.

At Indix, while building the Google of Products, we have had some good success in combining the best practices of continuous delivery in building our machine learning pipelines using open source tools and frameworks. The talk will not focus on the theory of ML or about choosing the right ML algorithm but specifically on the last mile problem of taking models to production and the lessons learned while applying the concept of CD to ML.. Here are some of the key questions that the talk with try to answer.

  1. ML Models Repository as analogous to Software Artifacts Repository - Similar to a software repository, what are the features of a Models Repository to aid traceability and reproducibility? Specifically, how do you manage models end to end - managing model metadata, visualization and lineage etc? 
  2. ML Pipelines to orchestrate and visualize the end to end flow - A typical ML workflow has multiple stages. How do you model your entire workflow as a pipeline (similar to Build Pipeline in CD) to automate the entire process and help visualize the entire end to end flow? 
  3. Model Quality Assurance - What quality gates and evaluation metrics, either manual and automated, should be used before exporting (promoting) models for serving in production? What happens when several different models are in play? How do you measure the models individually and then also in combination 
  4. Serving Models in Production - How do you serve and scale these models in production? What happens when these models are heterogenous (built using different languages - Scala, Python etc.)? 
  5. Regression Testing of Models - When exporting a new models, whats the best way to compare the performance of the newer model to the one already deployed on real-world (production) data? 
  6. Maintenance and Monitoring of Models in production - Deploying models to production is only half the job done. How do you measure the performance of your model while its running in production?

avatar for Rajesh Muppalla

Rajesh Muppalla

Co-Founder & Senior Director of Engineering, Indix
Rajesh Muppalla is a co-founder and Senior Director of Engineering at Indix, where he leads the data platform team that is responsible for collecting, organizing and structuring all the product related data collected from the web.

Friday November 17, 2017 3:00pm - 3:40pm PST

4:00pm PST

Fireworks - lighting up the sky with millions of Sparks
The Salesforce Einstein platform is used by internal developers to create predictive applications for Salesforce customers. The platform uses spark as its data processing engine, and runs a very large number of data flows with very large variance in size and complexity. Both large and complex flows, such as running modeling for a customer who needs tens of millions of entities scored, and small time sensitive flows, such as incrementally processing and scoring object changes for a customer with only thousands of entities in total, must be supported. These diverse and complex flows arise for every application added to Salesforce and the platform handles many applications magnifying the importance of appropriate scaling and time sensitivity. In this talk, I'll present how we handle that large amount of diversity in data flows while keeping cost to serve to a minimum. I will detail where and how we chose to leverage open source and where we decided it was important to implement our own solutions.

avatar for Thomas Gerber

Thomas Gerber

Director of Engineering, Salesforce

Friday November 17, 2017 4:00pm - 4:40pm PST
Saturday, November 18

9:50am PST

Democratizing data with an internal data pipeline platform
This talk is about why and how we built our internal data pipeline platform.

At Indix we have data in different formats - html pages, thrift records, avro records and the usual culprits - CSVs and other plain text formats. We have data in TBs and in a few KBs and data consisting of billions of records and data consisting of a few hundred rows. And all this data - in one form or another - is consumed by the engineers, the product managers, the customer success team and even our CEO.

Our biggest challenge was in knowing which data exists and where, and how to access it efficiently while balancing costs and productivity of the people involved. We had to make do with adhoc Scalding jobs. There was no single place where people can discover the different "datasets" that we had, what format they were in, where they were stored and how frequently a new version was published. Running jobs was also not straightforward since things like finding a cluster to use were not trivial. In order to democratize the access to data and make it easy for anyone within the organization to work and play with the data we had, we went about building a data pipeline platform for our internal users.

Leveraging the power of Spark, the platform allows the users to define datasets (along with their schema) and create pipelines to work with the datasets. The pipelines can be configured via a wizard based UI or a JSON config and all the jobs run on dedicated and auto scaled Spark clusters. Predefined transformations to filter, project, sample and even type in sql queries have made it powerful but simple to use for any type of user. Support for S3, Sftp and even Google sheets made it usable for different internal and customer use cases. The platform also enables us to load the same data and perform similar operations on them via notebooks with just couple of lines of client code. Today we run over 300 pipelines across over 100 datasets and thousands of versions of the datasets using this platform.

The data pipeline platform has truly changed the way we ingest, manipulate, analyze and egress data across the organization, and is on course to be converted into a self-serve platform for our (external) customers too.

avatar for Manoj Mahalingam

Manoj Mahalingam

Principal Engineer, Indix

Saturday November 18, 2017 9:50am - 10:30am PST

10:40am PST

Deep distributed decision trees on Apache Spark
Deep distributed decision trees and tree ensembles have grown in importance due to the need to model increasingly large datasets. We present Yggdrasil, a new tree learning method implemented in Scala on Apache Spark which scales favorably as data dimensionality and tree depth grows. By partitioning the dataset by columns rather than rows, training directly on compressed data, and minimizing communication using sparse bitvectors, Yggdrasil outperforms existing distributed tree learning algorithms by up to 24x. On a high-dimensional dataset at Yahoo, Yggdrasil is shown to be faster by up to an order of magnitude.

avatar for Feynman Liang

Feynman Liang

Director of Engineering, Gigster
Feynman is the engineering manager at Gigster and a statistics PhD student at UC Berkeley. His research lies at the intersection between industry and academia, focusing on distributed machine learning and practical systems for deploying machine learning in production. He is a contributor... Read More →

Saturday November 18, 2017 10:40am - 11:00am PST

11:10am PST

End-to-End Computation on the GPU with a GPU Data Frame
A revolution is occurring across the GPU software stack, driven by the disruptive performance gains GPUs have seen generation after generation. The modern field of deep learning would have not been possible without GPUs, and as a database we are often seeing two-or-more orders of magnitude performance gains compared to CPU systems - but for all of the innovation occurring in the GPU software ecosystem, the systems and platforms themselves still remain isolated from each other. Even though the individual components are seeing significant acceleration from running on the GPU, they must intercommunicate over the relatively thin straw of the PCIe and then through CPU memory. In this session, Todd Mostak will make a case for the open source community to enable efficient intra-GPU communication between different processes running on the GPUs. He will discuss (and provide examples) how this integration will allow developers to build new functions to cluster or perform analysis on queries, and will make seamless workflows that combine data processing, machine learning (ML), and visualization possible without ever needing to leave the GPU.

avatar for Todd Sundsted

Todd Sundsted

CTO, SumAll
Todd Sundsted is a hands-on technical leader with 25 years of professional experience covering all aspects of software development, machine learning and engineering. He currently serves as the CTO of SumAll, an award winning analytics and business intelligence tool used by brands... Read More →

Saturday November 18, 2017 11:10am - 11:30am PST

11:40am PST

Stream All The Things!
While stream processing is now popular, streaming architectures must be highly reliable and scalable as never before, more like microservice architectures. Using specific use cases, I'll define the requirements for streaming systems and how they are met by popular tools like Kafka, Spark, Flink, and Akka. I'll argue that streaming and microservice architectures are actually converging.

avatar for Dean Wampler

Dean Wampler

VP of Rocket Surgery, Lightbend
Dean Wampler, Ph.D., is the VP of Fast Data Engineering at Lightbend. He leads the development of Lightbend Fast Data Platform, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean... Read More →

Saturday November 18, 2017 11:40am - 12:20pm PST

1:10pm PST

Strato: Twitter’s Virtual Database Powered by Microservices
Developing software against a large collection of services is often harder and more difficult than it needs to be. Inconsistencies between service interfaces mean each type of data is queried in a slightly different way, making abstraction difficult and leading to boilerplate. By enforcing a consistent data and access model over these heterogeneous microservices we can simplify feature development. Strato exposes data from other services according to a unified data model and a single logical interface, enabling generic infrastructure for automatically generated GraphQL, REST, and Thrift APIs; drop-in caching; simplified access control; deploy-free updates; and more! Come see how we simplify life for data owners and data consumers alike!

avatar for Michael Solomon

Michael Solomon

Software Engineer, Twitter
Mike Solomon is a software engineer on Twitter's Strato team where he uses Scala to generate uniform GraphQL, REST, and Scala APIs, and tries to make building new API services unnecessary.In his spare time he makes an audio-based choose-your-own adventure mobile game called Road Trip... Read More →

Saturday November 18, 2017 1:10pm - 1:30pm PST

1:40pm PST

Magnolia: Generic Derivation 2.0

In the last few years, typeclasses have become an increasingly popular tool for solving a wide variety of problems Scala developers encounter every day. But while typeclasses can be composed in entirely predictable ways from smaller primitive types to support larger abstract datatypes, there's no support offered from the language to do this.

For a while, Shapeless has been an admirable enabler, leveraging implicit search and a couple of macros to provide automatic derivations, but its approach to the problem is the cause of very slow compile times; the definitions needed to derive a typeclass are often verbose, type-heavy and complex; and, when derivation fails, the user gets no debugging feedback.

Magnolia basically fixes all of these problems.
The talk will explore how Impromptu is implemented, and show how dependent types allow the framework to be written in just 30 lines of code. I will then demonstrate how a similar approach may be used to concisely implement typed actors.

Furthermore, we take advantage of current research into implicit functions in Dotty to remove the last remaining boilerplate from Impromptu's API.

avatar for Jon Pretty

Jon Pretty

Software Engineer, Propensive

Saturday November 18, 2017 1:40pm - 2:00pm PST

2:10pm PST

Complex Machine Learning Pipelines Made Easy

What if you had to build more machine learnt models than there are data scientists in the world? At enterprise companies like Salesforce, customer data comes in vastly different shapes and forms, making it impossible to build one catch-all model even when focusing on a single problem. Instead, it becomes necessary to build thousands of personalized, per-customer models for any single data-driven application.  At Salesforce, we have built solutions to these problems into a project called Optimus Prime which we are using to develop robust, production-quality machine learning applications much more quickly than using Spark alone. 

In this talk, we will demonstrate two applications of this platform. The first is AutoML which enables building simple yet powerful models for any use case even without having any background in data science. We will describe the underlying challenges of automating machine learning ranging from the user interface to data extraction and model building, touching more deeply on how we automate feature selection and model selection. The result is a system where users only need domain expertise to build production-ready machine learning applications.

 The second demonstration will be of a data product more finely tuned to a specific application. We will demonstrate a product currently in development, Case Classification - automatic classification of service cases. This application is built to not only train and predict on each customer’s individual data, but it is also able to scale the ML pipeline dynamically to accommodate any number of prediction fields; it is multi-tenant, multi-label, multi-model, multi-class predictions. We’ll contrast our implementation using Optimus Prime against one in pure Spark and then show the resulting pipeline performance on real customer data.

avatar for Till Bergmann

Till Bergmann

Sr. Data Scientist, Salesforce
avatar for Chris Rupley

Chris Rupley

Sr. Data Scientist, Salesforce

Saturday November 18, 2017 2:10pm - 2:50pm PST

3:00pm PST

Deep Dive: Continuous Delivery for AI Applications with ECS
Deep learning (DL) is a computer science field derived from the Artificial Intelligence discipline. DL systems are usually developed by data scientists, who are good at mathematics and computer science. But to deploy and operationalize these models for broader use, you need the DevOps mindset and tools. In this tech talk, we’ll show you how to connect the workflow between the data scientists and DevOps. We’ll explore basic continuous integration and delivery concepts and how they can be applied to deep learning models. Using a number of AWS services, we will showcase how you can take the output of a deep learning model and deploy it to perform predictions in real time with low latency and high availability. In particular, we will showcase the ease of deploying DL predict functions using Apache MXNet (a deep learning library), Amazon ECS, Amazon S3, and Amazon ECR, Amazon developer tools, and AWS CloudFormation

avatar for Asif Khan

Asif Khan

Containers, Deep learning, Amazon Web Service
Asif Khan is an Cloud Architect with Amazon Web Services. He provides technical guidance, design advice and thought leadership to some of the largest and successful AWS customers and partners on the planet. His deepest expertise spans application architecture, containers, devops... Read More →

Saturday November 18, 2017 3:00pm - 3:40pm PST

4:00pm PST

Scaling From Research to Production with Skymind DL4J and ScalNet

DeepLearning4J (Deep Learning for Java - DL4J, inception 2013) was specifically designed with Enterprise and Production in mind, as a first-class citizen to the JVM.  Skymind develops and maintains the complete DL4J stack and the abstraction for Scala (ScalNet) with a focal point on scalability and vendor integrations.  

This session will focus on the challenges in migrating a research prototype to a more production ready system within the JVM.  Specifically, migrating/importing an alternative Deep Learning Framework based on python bindings (e.g. Keras via Tensorflow) to DL4J/ScalNet within a distributed environment using Apache Spark. 

A walkthrough of a temporal IoT use case modeling an LSTM Network demonstrating the different phases of a project will be shown.  Furthermore, the different workflow capabilities in crossing the language boundaries.  


avatar for Ari Kamlani

Ari Kamlani

Principal Data Scientist, ThoughtWorks
Data Scientist and Technology Strategist & Advisor, currently employed as a Deep Learning Consultant with Skymind and Technologist in Residence (TIR) with Techstars IoT. Previously a Data Scientist & Engineering Consultant at Otto (Tyto) for the Connected Home and Research Assistant... Read More →

Saturday November 18, 2017 4:00pm - 4:40pm PST