Optimize your data science workflow within Jupyter Notebook using mlflow and pandas-profiling!

Published in

Towards ML engineering

8 min readOct 27, 2022

Article banner image generated with DALL-e and modified

Introduction

· There are many benefits from optimizing the data science experimentation workflow: it simplifies and eases Data Science work which in turn leads to more experiments being carried out and a higher probability of success for the project.

· Building ML pipelines during experimentation can save up time, makes code more readable, improve reproducibility and reduce time-to-market for a ML model into production-ready code (#mlengineering)

· Using a tool such as mlflow to consistently log metrics from experiments moves the results out of the notebook environment, providing better viewing and reliable tracking.

I) Sample project: Kickstarter campaigns’ prediction of success

Cleaning up data and doing exploratory data analysis is a crucial part of any Data Science based project as your results will only be as trustworthy and conclusive as the data (=the examples) it will be trained on.

Furthermore, Data Scientists often have to experiment in a short amount of time — depending on the project’s duration and return on investment on quick experimentations — to get preliminary results, the work involves training and evaluating multiple models on one dataset.

Note: recently, Andrew Ng challenged this idea by urging the community to be more data-centric and less model-centric, indeed, data quality matters as well.

For this example, we will be using a dataset with Kickstarter campaigns characteristics, Kickstarter is a funding platform for creative projects, we selected the following features for the study:

- Name of the project
- (sub) Category of the project
- Main Category of the project
- Currency of pledged money
- Deadline timestamp
- Goal amount of money
- Launch date timestamp
- Pledged money from backers (in currency)
- State of the project
- Number of backers
- Country of project
- Pledged amount of money from backers (in usd)

The goal here is to predict the success of a campaign given initial launch features, it can be translated as predicting if a project will get a backer’s sufficient funding before the deadline. We are going to remove some features (like the number of backers) that inform about the end state and can’t be used for prediction at the start of a project.

We are going to clean up the features, generate the features needed for training and train and evaluate classification models to demonstrate how you can simply improve your exploratory data analysis and experimentation process for data science projects.

You can find the git repository, including all the code for this project, here: https://github.com/Jasminyas/kickstarter-campaign-prediction

Figure 2 — Machine Learning development process

II) Optimizing the exploratory data analysis (EDA) part using data visualization libraries

At the beginning of this project, we are going to explore the data and plot multiple graphs that make us understand how the categories and other features associated with Kickstarter projects are distributed among failing and successful states (which is the target variable). It requires a lot of code to generate the appropriate graphs using plotly and seaborn.

I really like using plotly to generate dynamic graphs, however it can sometimes be troublesome to calculate the right values to plot with plotly functions, furthermore, the rendered charts are heavy (a lot of them in one notebook and it can make it lag).

That is why I also use seaborn for fast visualization and if you compute some numbers or do some graphs enough, it’s always a good thing to think about reusability and code utils visualization function (you could even package it as a python library to use in multiple scripts) and thus reuse those functions every time you need to do EDA for a project!

import seaborn as sns
plt.style.use('ggplot')def plot_continuous_vars(data, column_name):
    plot_dims   = (14, 8)
    fig, (ax1,   ax2) = plt.subplots(ncols=2, sharey=False,   figsize=dims)
    sns.distplot(data[column_name], ax=ax1)
    sns.distplot(np.log1p(data[column_name]), ax=ax2)

Figure 3 — Plotting continuous and log-transform of usd_goal variable

This work done once can be reused for the next projects. It is with that in mind that we can also optimize our workflow by using the next library introduced in the next section.

-> Pandas-profiling for outlier detection, correlation, and feature distributions in one place

Another library that is helpful for this exploratory and cleaning up phase of the data is pandas-profiling , it can easily generate graphs with associated useful statistics for all the variables of your dataset and it also renders correlation heatmaps for all the features, this can be integrated as widgets inside the notebook or standalone html files, you can find the html file here: https://github.com/Jasminyas/kickstarter-campaign-prediction/blob/main/kickstarterds.html

Figure 4 — Generating the html report using pandas-profiling

You just need two lines to generate this rich html file that contains all the graphs you need to make decisions regarding outliers removal, dropping of correlated features and further investigation on problematic ones. I suggest you do that at the beginning of making the training dataset.

III) Optimize the ML pipeline for model training part

Now that we have seen how we could improve the data analysis phase, let us move onto the model training part.

Let’s see how sklearn pipelines can come in handy for preparing the data and applying different models with good modularity and reusability.

1) Preprocessing pipeline

Here we have numeric variables we want to scale and categorical features we are going to encode with the OneHotEncoder provided by scikit-learn :

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncodernumeric_features = ['usd_goal_corrected', 'duration', 'len_name']numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", RobustScaler()),
    ]
)categorical_features = ['country', 'cat_full', 'month', 'name_non_chars', 'name_has_symbol', 'name_upper']
categorical_transformer = Pipeline(
    steps=[("onehot", OneHotEncoder(handle_unknown="ignore"))]
)preprocessing = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="passthrough",
)

As you can see we made a processing step that will take care of the features transformations for us.

Now, we can just use these steps easily and apply different classifiers :

# precision-recall curve and f1 for evaluation purposesfrom sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from matplotlib import pyplotlr = LogisticRegression(solver='liblinear')
model = Pipeline([('preprocessing', preprocessing),
                ('lr',lr)])
model.fit(x_train, y_train)#predict_plot is outputting the precision-recall curve with the associated auc and precision score.
predict_plot(x_train, y_train, model, "Logistic Regression")
predict_plot(x_test, y_test, model, "Logistic Regression")

You just need to replace the classifier by another one. Here’s another example with a decision tree-based algorithm from LightGBM library:

from lightgbm import LGBMClassifierclf = make_pipeline(
    preprocessing,
    LGBMClassifier(learning_rate=0.7, boosting_type="gbdt", objective='binary', metric='precision', max_depth=-1)
)clf.fit(x_train, y_train)
    
predict_plot(x_train, y_train, clf, 'GBM')

Now, we can also use this pipeline in grid search to find the optimal parameters.

You first need to create a parameters grid for your specific model. In the following code, you’ll note that we need to prefix each parameter name with the classifier name (here it’s ‘clf’).

First I created this basic class that inherits from BaseEstimator :

from sklearn.base import BaseEstimatorclass ClfSwitcher(BaseEstimator):def __init__(
        self, 
        estimator = LogisticRegression(),
    ):
        """
        A Custom BaseEstimator that can switch between classifiers.
        :param estimator: sklearn object - The classifier
        """self.estimator = estimatordef fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return selfdef predict(self, X, y=None):
        return self.estimator.predict(X)def predict_proba(self, X):
        return self.estimator.predict_proba(X)def score(self, X, y):
        return self.estimator.score(X, y)

Then I can call it here:

parameters = [
    {
        'clf__estimator': [LogisticRegression()],
        'clf__estimator__solver': ["lbfgs", "liblinear"],
        "clf__estimator__penalty": ["l2"],
        "clf__estimator__C": [0.1, 0.2, 0.3, 0.5, 1.0],
        "clf__estimator__max_iter": [100, 1000, 2000],
    },
    {
        'clf__estimator': [LogisticRegression()],
        'clf__estimator__solver': ["liblinear"],
        "clf__estimator__penalty": ["l1"],
        "clf__estimator__C": [0.1, 0.2, 0.3, 0.5, 1.0],
        "clf__estimator__max_iter": [100, 1000, 2000],
    },
]gscv = GridSearchCV(pipeline, parameters, cv=2, n_jobs=-1, verbose=3)
gs_model = gscv.fit(x_train, y_train)print(gs_model.best_params_, gs_model.best_score_)

Without sklearn pipelines, the code would have been messy and keeping track of experiments would be hard. It helps in automating the process of applying and evaluating models and the code is close to production-readiness with that kind of modular, reusable functions. These functions can also be used in all your experiments to accelerate training with different models.

IV) Use Jupyter Notebook with MLflow to keep a record of experimentations

We can leverage mlflow which is an open-source tracking tool for data science experiments, here I am going to show you the piece of code that saves up my experiments and useful visuals locally — you will find them in a folder named ‘mlruns’ where your Notebook is located:

from pprint import pprint
import mlflowclf = make_pipeline(
 preprocessing,
 LGBMClassifier(learning_rate=0.7, boosting_type="gbdt", objective='binary', metric='precision', max_depth=-1)
)def fetch_logged_data(run_id):
 client = mlflow.tracking.MlflowClient()
 data = client.get_run(run_id).data
 tags = {k: v for k, v in data.tags.items() if not k.startswith("mlflow.")}
 artifacts = [f.path for f in client.list_artifacts(run_id, "model")]
 return data.params, data.metrics, tags, artifacts# enable autologging for sklearn based modelsmlflow.sklearn.autolog()with mlflow.start_run() as run:
 clf.fit(x_train, y_train)# fetch logged data
params, metrics, tags, artifacts = fetch_logged_data(run.info.run_id)## sanity checks
pprint(params)
pprint(metrics)
pprint(tags)
pprint(artifacts)

Figure 5 — Models metrics saved with mlflow

It will save up all the metrics, the model and useful plots in the mlflow folder locally, you can find more information about how to set it up with a server in their documentation : where are runs recorded ?

You can access the files through the UI by running “mlflow ui” in a terminal at the root path that contains the mlflow directory.
On the UI platform, we can order by variables, for instance here I could use training_accuracy to order the runs in a descending order and see which model is at the top.

By clicking on an experiment, you can access all the graphs and metrics that were registered.

Figure 7 — mlflow UI interface (specific run)

You can then select the model in accordance with the metrics chosen for evaluation purposes for every project.

Conclusion

We saw how one can optimize the Data Science experimentation process for rapid prototyping and evaluation of models. One should bear in mind that Notebooks should be used for experimenting and that while Data Scientist coding in a modular and reusable way in notebooks is useful for the sharing of notebooks and the preparation to put in production, they are not meant to be production ready in themselves. All these tools are meant to facilitate the experimentation phase and pinpoint the right algorithms to be used in a production environment.

If you have found it interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, and you can also connect or reach out to me on LinkedIn. Thank you for reading :)