Building real-world machine learning (ML) algorithms is an iterative process. A data scientist will build many 10s to 100s of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad hoc, and there is no practical way for a data scientist to manage models that are built over time, track insights or run meta-analyses across models. The ad-hoc, iterative nature of modeling leads to an important, and little-studied, problem for machine learning systems, namely, model management.
Model management involves the capture, storage and querying of machine learning models. It provides essential support for various activities including sensemaking and guiding the modeling process (e.g.,“feature F1 seems to improve performance for this subset of data”), tracking model versions (e.g., “what changed between model v1 and v2”), and enabling meta-analyses across models (e.g. “what hyperparameter settings work best for these features”).
A novel end-to-end system for managing machine learning models. ModelDB provides a set of native client libraries that automatically logs models and pipelines as the data scientist is building them, providing a rich dataset that can later be queried.
We are building a novel end-to-end system called ModelDB to manage ML models. ModelDB clients automatically track machine learning models in their native environments (e.g., scikit-learn, spark.ml), the ModelDB backend introduces a common layer of abstractions to represent models and pipelines, and the ModelDB frontend allows visual exploration and analyses via a web-based interface. Figure 1 shows the high-level architecture of ModelDB.
To understand the model building process, we interviewed data scientists from a host of different industries. We found that one of the key challenges in model management is capturing the models that are being built during offline experimentation. Often, data scientists will build many models by updating and overwriting the same script (or, if lucky, a config file), soon losing track of previously built models and insights gained from them. Since ML models co-exist with the pipelines or workflows that created them, many times it was unclear whether a change in the pipeline or model had produced changes in performance.
Moreover, we found that data scientists usually had an ML environment of choice (e.g., scikit-learn, R, spark.ml), and the use of a separate workflow manager created extra overhead and limited modeling flexibility. As a result, we implemented native logging libraries for different ML environments that would capture models built by a data scientist along with pre-processing operations performed on the data (e.g., one-hot-encoding, scaling). As of now, we have written logging libraries for scikit-learn and spark.ml. Libraries for different ML environments implement a ModelDB thrift interface that is used to communicate with the backend.
Figure 2 shows an example of a script in spark.ml that uses the ModelDB library for logging operations. The data scientist imports the library and initializes the ModelDB syncer. Then by using “sync”-variants of pre-processing or modeling functions (e.g., fitSync in place of fit, randomSplitSync in place of randomSplit), the relevant operations and associated data are logged to the ModelDB. Note that in addition to logging pipelines and models, ModelDB also allows data scientists to log annotations or insights about models (e.g., “pipeline with no normalization.”)
As we can see, using ModelDB requires minimal amount of change to the script. Moreover, as the data scientist changes and re-runs the script to produce new models, changes to model and pipeline are logged automatically and become available in the ModelDB for further analysis. In the future, we can further minimize the API changes by incorporating logging directly into the ML environments.
Once models and pipelines are captured in a consistent format, it can allow the data scientist to run diverse queries ranging from simple (but impossible without ModelDB) selection queries such as “Find all models containing a particular feature” to more complex model-specific ones, e.g., “What feature is most important in this random forest?” We store data in ModelDB using a combination of relational and custom storage formats to enable compact storage and easy querying.
While data in ModelDB can be queried via SQL, we designed a user interface specifically tailored for exploring machine-learning models and pipelines.
Figures 3 and 4 show screenshots of the ModelDB frontend. ModelDB supports two main views for exploring data: the model view and the pipeline view. The models view (Figure 3) tabulates models generated by all pipelines built for a project. It is best suited for obtaining a summary of models and performing comparisons across models. We support a Tableau-like interface to visually analyze models using their metadata and metrics. The functionality provided in this view, for example, makes it easy to graph how changes in a hyperparameter impact the accuracy of a model.
The pipelines view (Figure 4) allows the user to explore a small number of pipelines in detail. For a given pipeline, this view depicts the input, output and parameters for every stage of the pipeline. It also provides the ability to align and compare multiple pipelines to find similarities and differences.
With this work, we introduce ModelDB, a novel end-to-end system for managing machine learning models. ModelDB provides a set of native client libraries that automatically logs models and pipelines as the data scientist is building them, providing a rich dataset that can later be queried. We provide a visual interface for data scientists to query information in the ModelDB.
We will be presenting this work at two upcoming workshops: HILDA (Human-in-the-loop Data Analytics) co-located with SIGMOD 2016 (June 26) and ML Sys (Machine Learning Systems) co-located with ICML 2016 (June 24).
We will be making ModelDB available to the public in Fall 2016. Please reach out (mvartak _at_ csail.mit.edu) with suggestions or feedback on improving ModelDB or if you’d like to try it out early.
Update, February 8, 2017: You can visit the new ModelDB project page here.