How to measure the relationship between variables

An attempt to explain covariance and correlation in the simplest form.

(image by author)

Analyzing and visualizing variables one at a time is not enough. To make various conclusions and analyses when performing exploratory data analysis, we need to understand how the variables in a dataset interact with respect to each other. There are numerous ways to analyze this relationship visually, one of the most common methods is the use of popular scatterplots. But scatterplots come with certain limitations which we will see in the later sections. Quantitatively, covariance and correlations are used to define the relationship between variables.


A scatterplot is one of the most common visual forms when it comes to comprehending the relationship between variables at a glance. In the simplest form, this is nothing but a plot of Variable A against Variable B: either one being plotted on the x-axis and the remaining one on the y-axis

%matplotlib inline
import matplotlib.pyplot as plt'seaborn-whitegrid')
import numpy as npdf = pd.read_csv('weight-height.csv')
df.head()plt.plot(df.Height, df.Weight,'o',markersize=2, color='brown')
Image1: Scatterplot Height vs Weight: Positive Relationship

In the above graph, it’s easy to see that there seems to be a positive relationship between the two variables i.e. as one increases the other increases as well. A scatterplot with a negative relationship i.e. as one variable increases the other reduces may take the form of Image 2.

#Just for demonstration purposes I have taken 'a' and 'b'
import numpy as np
import random
import matplotlib.pyplot as plt
a = np.random.rand(100)*70
b = 100-a
plt.plot(a, b,'o',markersize=2, color='brown')
Image 2: Scatterplot a Vs b: Negative Relationship

A scatterplot with no apparent relationship between the two variables would take the form of Image 3:

import numpy as np
import random
import matplotlib.pyplot as plt
a = np.random.rand(1000)*70
b = np.random.rand(1000)*100
plt.plot(a, b,'o',markersize=2, color='brown')
Image 3: Scatterplot a Vs b: No apparent relationship

In general, scatterplots are best for analyzing two continuous variables. Visualizing two discrete variables using a scatterplot may cause the data points to overlap. Let’s see how a scatterplot would look like in the case of discrete variables.

x = [1,1,1,1,2,2,2,2,3,3,3,3]
y = [10,15,15,15,16,16,20,20,20,25,25,25]plt.plot(x,y,'o',markersize=5, color='brown')
Image 4: Scatterplot X vs Y: Discrete Variables

In the above image, not all points are visible. To overcome this, we add random noise to the data called “Jitter”. The process is naturally called jittering to allow for a somewhat clear visualization of those overlapped points.

def Jitter(values, jitter):
n = len(values)
return np.random.uniform(-jitter, +jitter, n) + values
y1 = Jitter(y,0.9)
plt.plot(x,y1,'o',markersize=4, color='brown')
Image 5: Scatterplot X vs Y1: With Jitter added

As seen in Image 5, more data points are now visible. However, jitter should be used only for visualization purposes and should be avoided for analysis purposes.

There can be an overlap of data in the case of continuous variables as well, where overlapping points can hide in the dense part of the data and outliers may be given disproportionate emphasis as seen in Image 1. This is called Saturation.

Scatterplot comes with its own disadvantages as it doesn’t provide quantitative measurement about the relationship, and simply shows the expression of quantitative change. We also can’t use scatterplots to display the relationship between more than two variables. Covariance and Correlation solve both these problems.


Covariance measures how variables vary together. A positive covariance means that the variables vary together in the same direction, a negative covariance means they vary in the opposite direction and 0 covariance means that the variables don’t vary together or they are independent of each other. In other words, if there are two variables X & Y, positive covariance means a larger value of X implies a larger value of Y and negative covariance means a larger value of X implies a smaller value of Y.

Mathematically, Cov(x,y) is given by the following formula, where dxi = xi-xmean and dyi = yi -ymean. Note that the following is the formula for the covariance of a population, when calculating covariance of a sample 1/n is replaced by 1/(n-1). Why is it so, is beyond the scope of this article.

Let’s understand this with an example: 
x = [34,56,78,23]
y = [20,45,91,16]
=> xmean = 47.75
=> ymean = 43
=> Sum of (dxi*dyi) = (34–47.75)*(20–43) + (56–47.75)*(45–43) + (78–47.75)*(91–43) + (23–47.75)*(16–43) = 2453.
=> Cov(x,y) = 2453/4 = 613.25

In the above example, we can clearly see that as x increases, y increases too and hence we get a positive covariance. Now, let’s consider that x and y have units. x is height in ‘cm’ and y is weight in ‘lbs’. The unit for covariance would then be cm-lbs. Whatever that means!

Covariance can practically take any number which can be overcome using correlation which is in the range of -1 to 1. So covariance doesn’t exactly tell how strong the relationship is but simply the direction of the relationship. For these reasons, it’s also difficult to interpret covariance. To overcome some of these disadvantages we use Correlation.


Correlation again provides quantitive information regarding the relationship between variables. Measuring correlation can be challenging if the variables have different units or if the data distributions of the variables are different from each other. Two methods of calculating correlation can help with these issues: 1) Pearson Correlation 2) Spearman Rank Correlation.

Both these methods of calculating correlation involve transforming the data in the variables being compared to some standard comparable format. Let’s see what transformations are done in both these methods.

Pearson Correlation

Pearson correlation involves transforming each of the values in the variables to a standard score or Z score i.e. finding the number of standard deviations away from each of the values is from the mean and calculating the sum of the corresponding products of the standard scores.

Z score = (Xi-Xmean)/Sigma, where sigma implies standard deviation

Suppose we have 2 variables 'x' and 'y' 
Z score of x i.e. Zx = (x-xmu)/Sx
Where xmu is the mean, Sx is standard deviation
Translating this info to our understanding of Pearson Correlation (p):
=> pi = Zxi*Zyi
=> pi = ((xi-xmean)*(yi-ymean))/Sx*Sy
=> p = mean of pi values
=> p = (sum of all values of pi)/n
=> p = (summation (xi-xmean)*(yi-ymean))/Sx*Sy*n
As seen above: (summation (xi-xmean)*(yi-ymean))/n is actually Cov(x,y).
So we can rewrite Pearson correlation (p) as Cov(x,y)/Sx*Sy
NOTE: Here, pi is not the same as mathematical constant Pi (22/7)

Pearson correlation ‘p’ will always be in the range of -1 to 1. A positive value of ‘p’ means as ‘x’ increases ‘y’ increases too, negative means as ‘x’ increases ‘y’ decreases and 0 means there is no apparent linear relationship between ‘x’ and ‘y’. Note that a zero Pearson correlation doesn’t imply ‘no relationship’, it simply means that there isn’t a linear relationship between ‘x’ and ‘y’.

Pearson correlation ‘p’ = 1 means a perfect positive relationship, however, a value of 0.5 or 0.4 implies there is a positive relationship but the relationship may not be as strong. The magnitude or the value of Pearson correlation determines the strength of the relationship.

But again, Pearson correlation does come with certain disadvantages. This method of correlation doesn’t work well if there are outliers in the data, as it can get affected by the outliers. Pearson Correlation works well if the changes in variable x with respect to variable y is linear i.e. when the change happens at a constant rate and when x and y are both somewhat normally distributed or when the data is on an interval scale.

These disadvantages of Pearson correlation can be overcome using the Spearman Rank Correlation.

Spearman Rank Correlation

In the Spearman method, we transform each of the values in both variables to its corresponding rank in the given variable and then calculate the Pearson correlation of the ranks.

Consider x = [23,98,56,1,0,56,1999,12], 
Corresponding Rankx = [4,7,5,2,1,6,8,3]
Similarly, for y = [5,92,88,45,2,54,90,1],
Corresponding Ranky = [3,8,6,4,2,5,7,1]

Looking at Rankx and Ranky, the advantage of this method seems to be apparent. Both Rankx and Ranky do not contain any outliers, even if the actual data has any outliers, the outlier will be converted into a rank that is nothing but the relative positive of the number in the dataset. Hence, this method is robust against outliers. This method also solves the problem of data distributions. The data distributions of the ranks will always be uniform. We then calculate the Pearson correlation of Rankx and Ranky using the formula seen in the Pearson correlation section.

But Spearman Rank method works well:

  1. When x changes as y does, but not necessarily at a constant rate i.e. when there is a non-linear relationship between x and y
  2. When x and y have different data distributions or non-normal distribution
  3. If you want to avoid the effect of outliers
  4. When data is on an ordinal scale

PyCaret — the library for low-code ML

Train, visualize, evaluate, interpret, and deploy models with minimal code

When we approach supervised machine learning problems, it can be tempting to just see how a random forest or gradient boosting model performs and stop experimenting if we are satisfied with the results. What if you could compare many different models with just one line of code? What if you could reduce each step of the data science process from feature engineering to model deployment to just a few lines of code?

This is exactly where PyCaret comes into play. PyCaret is a high-level, low-code Python library that makes it easy to compare, train, evaluate, tune, and deploy machine learning models with only a few lines of code. At its core, PyCaret is basically just a large wrapper over many data science libraries such as Scikit-learn, Yellowbrick, SHAP, Optuna, and Spacy. Yes, you could use these libraries for the same tasks, but if you don’t want to write a lot of code, PyCaret could save you a lot of time.

In this article, I will demonstrate how you can use PyCaret to quickly and easily build a machine learning project and prepare the final model for deployment.

Installing PyCaret

PyCaret is a large library with a lot of dependencies. I would recommend creating a virtual environment specifically for PyCaret using Conda so that the installation does not impact any of your existing libraries. To create and activate a virtual environment in Conda, run the following commands:

conda create --name pycaret_env python=3.6
conda activate pycaret_env

To install the default, smaller version of PyCaret with only the required dependencies, you can run the following command.

pip install pycaret

To install the full version of PyCaret, you should run the following command instead.

pip install pycaret[full]

Once PyCaret has been installed, deactivate the virtual environment and then add it to Jupyter with the following commands.

conda deactivate
python -m ipykernel install --user --name pycaret_env --display-name "pycaret_env"

Now, after launching a Jupyter Notebook in your browser, you should be able to see the option to change your environment to the one you just created.

Changing the Conda virtual environment in Jupyter.

Import Libraries

You can find the entire code for this article in this GitHub repository. In the code below, I simply imported Numpy and Pandas for handling the data for this demonstration.

import numpy as np
import pandas as pd

Read the Data

For this example, I used the California Housing Prices Dataset available on Kaggle. In the code below, I read this dataset into a dataframe and displayed the first ten rows of the dataframe.

housing_data = pd.read_csv('./data/housing.csv')
First ten rows of the housing dataset.

The output above gives us an idea of what the data looks like. The data contains mostly numerical features with one categorical feature for the proximity of each house to the ocean. The target column that we are trying to predict is the median_house_value column. The entire dataset contains a total of 20,640 observations.

Initialize Experiment

Now that we have the data, we can initialize a PyCaret experiment, which will preprocess the data and enable logging for all of the models that we will train on this dataset.

from pycaret.regression import *reg_experiment = setup(housing_data, 
target = 'median_house_value',

As demonstrated in the GIF below, running the code above preprocesses the data and then produces a dataframe with the options for the experiment.

Pycaret setup function output.

Compare Baseline Models

We can compare different baseline models at once to find the model that achieves the best K-fold cross-validation performance with the compare_models function as shown in the code below. I have excluded XGBoost in the example below for demonstration purposes.

best_model = compare_models(exclude=['xgboost'], fold=5)
Results of comparing different models.

The function produces a data frame with the performance statistics for each model and highlights the metrics for the best performing model, which in this case was the CatBoost regressor.

Creating a Model

We can also train a model in just a single line of code with PyCaret. The create_model function simply requires a string corresponding to the type of model that you want to train. You can find a complete list of acceptable strings and the corresponding regression models on the PyCaret documentation page for this function.

catboost = create_model('catboost')

The create_model function produces the dataframe above with cross-validation metrics for the trained CatBoost model.

Hyperparameter Tuning

Now that we have a trained model, we can optimize it even further with hyperparameter tuning. With just one line of code, we can tune the hyperparameters of this model as demonstrated below.

tuned_catboost = tune_model(catboost, n_iter=50, optimize = 'MAE')
Results of hyperparameter tuning with 10-fold cross-validation.

The most important results, in this case, the average metrics, are highlighted in yellow.

Visualizing the Model’s Performance

There are many plots that we can create with PyCaret to visualize a model’s performance. PyCaret uses another high-level library called Yellowbrick for building these visualizations.

Residual Plot

The plot_model function will produce a residual plot by default for a regression model as demonstrated below.

Residual plot for the tuned CatBoost model.

Prediction Error

We can also visualize the predicted values against the actual target values by creating a prediction error plot.

plot_model(tuned_catboost, plot = 'error')
Prediction error plot for the tuned CatBoost regressor.

The plot above is particularly useful because it gives us a visual representation of the R² coefficient for the CatBoost model. In a perfect scenario (R² = 1), where the predicted values exactly matched the actual target values, this plot would simply contain points along the dashed identity line.

Feature Importances

We can also visualize the feature importances for a model as shown below.

plot_model(tuned_catboost, plot = 'feature')
Feature importance plot for the CatBoost regressor.

Based on the plot above, we can see that the median_income feature is the most important feature when predicting the price of a house. Since this feature corresponds to the median income in the area in which a house was built, this evaluation makes perfect sense. Houses built in higher-income areas are likely more expensive than those in lower-income areas.

Evaluating the Model Using All Plots

We can also create multiple plots for evaluating a model with the evaluate_model function.

The interface created using the evaluate_model function.

Interpreting the Model

The interpret_model function is a useful tool for explaining the predictions of a model. This function uses a library for explainable machine learning called SHAP that I covered in the article below.How to make your machine learning models more explainableEspecially when presenting them to a non-technical

With just one line of code, we can create a SHAP beeswarm plot for the model.

SHAP plot produced by calling the interpret_model function.

Based on the plot above, we can see that the median_income field has the greatest impact on the predicted house value.


PyCaret also has a function for running automated machine learning (AutoML). We can specify the loss function or metric that we want to optimize and then just let the library take over as demonstrated below.

automl_model = automl(optimize = 'MAE')

In this example, the AutoML model also happens to be a CatBoost regressor, which we can confirm by printing out the model.


Running the print statement above produces the following output:

<catboost.core.CatBoostRegressor at 0x7f9f05f4aad0>

Generating Predictions

The predict_model function allows us to generate predictions by either using data from the experiment or new unseen data.

pred_holdouts = predict_model(automl_model)

The predict_model function above produces predictions for the holdout datasets used for validating the model during cross-validation. The code also gives us a dataframe with performance statistics for the predictions generated by the AutoML model.

Predictions generated by the AutoML model.

In the output above, the Label column represents the predictions generated by the AutoML model. We can also produce predictions on the entire dataset as demonstrated in the code below.

new_data = housing_data.copy()
new_data.drop(['median_house_value'], axis=1, inplace=True)
predictions = predict_model(automl_model, data=new_data)

Saving the Model

PyCaret also allows us to save trained models with the save_model function. This function saves the transformation pipeline for the model to a pickle file.

save_model(automl_model, model_name='automl-model')

We can also load the saved AutoML model with the load_model function.

loaded_model = load_model('automl-model')

Printing out the loaded model produces the following output:

display_types=True, features_todrop=[],
id_columns=[], ml_usecase='regression',
('cluster_all', 'passthrough'),
('dummy', Dummify(target='median_house_value')),
('fix_perfect', Remove_100(target='median_house_value')),
('clean_names', Clean_Colum_Names()),
('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
('dfs', 'passthrough'), ('pca', 'passthrough'),
<catboost.core.CatBoostRegressor object at 0x7fb750a0aad0>]],

As we can see from the output above, PyCaret not only saved the trained model at the end of the pipeline but also the feature engineering and data preprocessing steps at the beginning of the pipeline. Now, we have a production-ready machine learning pipeline in a single file and we don’t have to worry about putting the individual parts of the pipeline together.

Model Deployment

Now that we have a model pipeline that is ready for production, we can also deploy the model to a cloud platform such as AWS with the deploy_model function. Before running this function, you must run the following command to configure your AWS command-line interface if you plan on deploying the model to an S3 bucket:

aws configure

Running the code above will trigger a series of prompts for information like your AWS Secret Access Key that you will need to provide. Once this process is complete, you are ready to deploy the model with the deploy_model function.

deploy_model(automl_model, model_name = 'automl-model-aws', 
authentication = {'bucket' : 'pycaret-ca-housing-model'})

In the code above, I deployed the AutoML model to an S3 bucket named pycaret-ca-housing-model in AWS. From here, you can write an AWS Lambda function that pulls the model from S3 and runs in the cloud. PyCaret also allows you to load the model from S3 using the load_model function.

MLflow UI

Another nice feature of PyCaret is that it can log and track your machine learning experiments with a machine learning lifecycle tool called MLfLow. Running the command below will launch the MLflow user interface in your browser from localhost.

!mlflow ui
MLFlow dashboard.

In the dashboard above, we can see that MLflow keeps track of the runs for different models for your PyCaret experiments. You can view the performance metrics as well as the running times for each run in your experiment.

Pros and Cons of Using PyCaret

If you’ve read this far, you now have a basic understanding of how to use PyCaret. While PyCaret is a great tool, it comes with its own pros and cons that you should be aware of if you plan to use it for your data science projects.


  • Low-code library.
  • Great for simple, standard tasks and general-purpose machine learning.
  • Provides support for regression, classification, natural language processing, clustering, anomaly detection, and association rule mining.
  • Makes it easy to create and save complex transformation pipelines for models.
  • Makes it easy to visualize the performance of your model.


  • As of now, PyCaret is not ideal for text classification because the NLP utilities are limited to topic modeling algorithms.
  • PyCaret is not ideal for deep learning and doesn’t use Keras or PyTorch models.
  • You can’t perform more complex machine learning tasks such as image classification and text generation with PyCaret (at least with version 2.2.0).
  • By using PyCaret, you are sacrificing a certain degree of control for simple and high-level code.


In this article, I demonstrated how you can use PyCaret to complete all of the steps in a machine learning project ranging from data preprocessing to model deployment. While PyCaret is a useful tool, you should be aware of its pros and cons if you plan to use it for your data science projects. PyCaret is great for general-purpose machine learning with tabular data but as of version 2.2.0, it is not designed for more complex natural language processing, deep learning, and computer vision tasks. But it is still a time-saving tool and who knows, maybe the developers will add support for more complex tasks in the future?

Shapash: Making ML Models Understandable by Everyone

In this article, we will present Shapash, an open-source python library that helps Data Scientists to make their Machine Learning models more transparent and understandable by all!

Image for post
Shapash Web App

Shapash by MAIF is a Python Toolkit that facilitates the understanding of Machine Learning models to data scientists. It makes it easier to share and discuss the model interpretability with non-data specialists: business analysts, managers, end-users…

Concretely, Shapash provides easy-to-read visualizations and a web app. Shapash displays results with appropriate wording (preprocessing inverse/postprocessing). Shapashis useful in an operational context as it enables data scientists to use explicability from exploration to production: You can easily deploy local explainability in production to complete each of your forecasts/recommendations with a summary of the local explainability.

In this post, we will present the main features of Shapash and how it operates. We will illustrate the implementation of the library on a concrete use case.

Elements of context:

Interpretability and explicability of models are hot topics. There are many articles, publications, and open-source contributions about it. All these contributions do not deal with the same issues and challenges.

Most data scientists use these techniques for many reasons: to better understand their models, to check that they are consistent and unbiased, as well as for debugging.

However, there is more to it:

Intelligibility matters for pedagogic purposes. Intelligible Machine Learning models can be debated with people that are not data specialists: business analysts, final users…

Concretely, there are two steps in our Data Science projects that involve non-specialists:

  • Exploratory step & Model fitting:

At this step, data scientists and business analysts discuss what is at stake and to define the essential data they will integrate into the projects. It requires to well understand the subject and the main drivers of the problem we are modeling.

To do this, data scientists study global explicability, features importance, and which role the model’s top features play. They can also locally look at some individuals, especially outliers. A Web App is interesting at this phase because they need to look at visualizations and graphics. Discussing these results with business analysts is interesting to challenge the approach and validate the model.

  • Deploying the model in a production environment

That’s it! The model is validated, deployed, and gives predictions to the end-users. Local explicability can bring them a lot of value, only if there is a way to provide them with a good, useful, and understandable summary. It will be valuable to them for two reasons:

  • Transparency brings trust: He will trust models if he understands them.
  • Human stays in control: No model is 100% reliable. When they can understand the algorithm’s outputs, users can overturn the algorithm suggestions if they think they rest on incorrect data.

Shapash has been developed to help data scientists to meet these needs.

Image for post

Shapash key features:

  • Easy-to-read visualizations, for everyone.
  • A web app: To understand how a model works, you have to look at multiple graphs, features importance and global contribution of a feature to a model. A web app is a useful tool for this.
  • Several methods to show results with appropriate wording (preprocessing inverse, post-processing). You can easily add your data dictionaries, category-encoders object or sklearn ColumnTransformer for more explicit outputs.
  • Functions to easily save Pickle files and to export results in tables.
  • Explainability summary: the summary is configurable to fit with your need and to focus on what matters for local explicability.
  • Ability to easily deploy in a production environment and to complete every prediction/recommendation with a local explicability summary for each operational apps (Batch or API)
  • Shapash is open to several ways of proceeding: It can be used to easily access to results or to work on, better wording. Very few arguments are required to display results. But the more you work with cleaning and documenting the dataset, the clearer the results will be for the end-user.

Shapash works for Regression, Binary Classification or Multiclass problems.
It is compatible with many models: CatboostXgboostLightGBMSklearn EnsembleLinear modelsSVM.

Shapash is based on local contributions calculated with Shap (shapley values), Lime, or any technique which allows computing summable local contributions.


You can install the package through pip:

$pip install shapash

Shapash Demonstration

Let’s useShapash on a concrete dataset. In the rest of this article, we will show you how Shapash can explore models.

We will use the famous “House Prices” dataset from Kaggle to fit a regressor … and predict house prices! Let’s start by loading the Dataset:

import pandas as pd
from import data_loading
house_df, house_dict = data_loading('house_prices')

Encode the categorical features:

from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']
encoder = OrdinalEncoder(cols=categorical_features).fit(X_df)

Train, test split and model fitting.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75)
reg = RandomForestRegressor(n_estimators=200, min_samples_leaf=2).fit(Xtrain,ytrain)

And predict test data…

y_pred = pd.DataFrame(reg.predict(Xtest), columns=['pred'], index=Xtest.index)

Let’s discover and use Shapash SmartExplainer.

Step 1 — Import

from shapash.explainer.smart_explainer import SmartExplainer

Step 2 — Initialise a SmartExplainer Object

xpl = SmartExplainer(features_dict=house_dict) # Optional parameter 
  • features_dict: dict that specifies the meaning of each column name of the x pd.DataFrame.

Step 3 — Compile

preprocessing=encoder,# Optional: use inverse_transform method
y_pred=y_pred # Optional

The compile method permits to use of another optional parameter: postprocess. It gives the possibility to apply new functions to specify to have better wording (regex, mapping dict, …).

Now, we can display results and understand how the regression model works!

Step 4 — Launching the Web App

app = xpl.run_app()

The web app link appears in Jupyter output (access the demo here).

There are four parts in this Web App:

Image for post

Each one interacts to help to explore the model easily.

Features Importance: you can click on each feature to update the contribution plot below.

Contribution plot: How does a feature influence the prediction? Display violin or scatter plot of each local contribution of the feature.

Local Plot:

  • Local explanation: which features contribute the most to the predicted value.
  • You can use several buttons/sliders/lists to configure the summary of this local explainability. We will describe below with the filter method the different parameters you can work your summary with.
  • This web app is a useful tool to discuss with business analysts the best way to summarize the explainability to meet operational needs.

Selection Table: It allows the Web App user to select:

  • A subset to focus the exploration on this subset
  • A single row to display the associated local explanation

How to use the Data table to select a subset? At the top of the table, just below the name of the column that you want to use to filter, specify:

  • =Value, >Value, <Value
  • If you want to select every row containing a specific word, just type that word without “=”

There are a few options available on this web app (top right button). The most important one is probably the size of the sample (default: 1000). To avoid latency, the web app relies on a sample to display the results. Use this option to modify this sample size.

To kill the app:


Step 5 — The plots

All the plots are available in jupyter notebooks, the paragraph below describes the key points of each plot.

Feature Importance

This parameter allows comparing features importance of a subset. It is useful to detect specific behavior in a subset.

subset = [ 168, 54, 995, 799, 310, 322, 1374,
1106, 232, 645, 1170, 1229, 703, 66,
886, 160, 191, 1183, 1037, 991, 482,
725, 410, 59, 28, 719, 337, 36 ]xpl.plot.features_importance(selection=subset)
Image for post

Contribution plot

Contribution plots are used to answer questions like:

How a feature impacts my prediction? Does it contribute positively? Is the feature increasingly contributing? decreasingly? Are there any threshold effects? For a categorical variable, how does each modality contributes? …This plot completes the importance of the features for the interpretability, the global intelligibility of the model to better understand the influence of a feature on a model.

There are several parameters on this plot. Note that the plot displayed adapts depending on whether you are interested in a categorical or continuous variable (Violin or Scatter) and depending on the type of use case you address (regression, classification)

Image for post
Contribution plot applied to a continuous feature
Image for post

Contribution plot applied to a continuous feature.

Classification Case: Titanic Classifier — Contribution plot applied to categorical feature.

Local plot

You can use local plots for local explainability of models.

The filter () and local_plot () methods allow you to test and choose the best way to summarize the signal that the model has picked up. You can use it during the exploratory phase. You can then deploy this summary in a production environment for the end-user to understand in a few seconds what are the most influential criteria for each recommendation.

We will publish a second article to explain how to deploy local explainability in production.

Combine the filter and local_plot method

Use the filter method to specify how to summarize local explainability. You have four parameters to configure your summary:

  • max_contrib: maximum number of criteria to display
  • threshold: minimum value of the contribution (in absolute value) necessary to display a criterion
  • positive: display only positive contribution? Negative? (default None)
  • features_to_hide: list of features you don’t want to display

After defining these parameters, we can display the results with the local_plot() method, or export them with to_pandas().

Image for post

Export to pandas DataFrame:

summary_df = xpl.to_pandas()
Image for post

Compare plot

With the compare_plot() method, the SmartExplainer object makes it possible to understand why two or more individuals do not have the same predicted values. The most decisive criterion appears at the top of the plot.

xpl.plot.compare_plot(row_num=[0, 1, 2, 3, 4], max_features=8)
Image for post

We hope that Shapash will be useful in building trust in AI.

Tips to Make the Most of Pandas Groupby Function

Boost your exploratory data analysis process

Image for post

Pandas is a highly popular data analysis and manipulation library. It provides numerous functions to perform efficient data analysis. Furthermore, its syntax is simple and easy-to-understand.

In this article, we focus on a particular function of Pandas, the groupby. It is used to group the data points (i.e. rows) based on the categories or distinct values in a column. We can then calculate a statistic or apply a function on a numerical column with regards to the grouped categories.

The process will be clear as we go through the examples. Let’s start by importing the libraries.

import numpy as np
import pandas as pd

We also need a dataset for the examples. We will use a small sample from the Melbourne housing dataset available on Kaggle.

df = pd.read_csv("/content/melb_data.csv", usecols = ['Price','Landsize','Distance','Type', 'Regionname'])df = df[(df.Price < 3_000_000) & (df.Landsize < 1200)].sample(n=1000).reset_index(drop=True)df.head()
Image for post

I have only read a small part of the original dataset. The usecols parameter of the read_csv function allows for reading only the given columns of the csv file. I have also filtered out the outliers with regards to the price and land size. Finally, a random sample of 1000 observations (i.e. rows) is selected using the sample function.

Before starting on the tips, let’s implement a simple groupby function to perform average distance for each category in the type column.

Image for post

The houses (h) are further away from the central business district than the other two types on average.

We can now start with the tips to use the groupby function more effectively.

1. Customize the column names

The groupby function does not change or customize the column names so we do not really know what the aggregated values represent. For instance, in the previous example, it would be more informative to change the column name from “distance” to “avg_distance”.

One way to accomplish this is to use the agg function instead of the mean function.

avg_distance = ('Distance', 'mean')
Image for post

We can always change the column name afterwards but this method is more practical.

Customizing the column names becomes more important if we aggregate multiple columns or apply different functions to one column. The agg function accepts multiple aggregations. We just need to specify the column name and the function.

For instance, we can calculate the average and median distance values for each category in the type column as below.

avg_distance = ('Distance', 'mean'),
median_distance = ('Distance', 'median')
Image for post

2. Lambda expressions

Lambda expression is a special form of functions in Python. In general, lambda expressions are used without a name so we do not define them with the def keyword like normal functions.

The main motivations behind the lambda expressions are simplicity and practicality. They are one-liners and usually only used at once.

The agg function accepts lambda expressions. Thus, we can perform more complex calculations and transformations along with the groupby function.

For instance, we can calculate the average price for each type and convert it to millions with one lambda expression.

avg_price_million = ('Price', lambda x: x.mean() / 1_000_000)
Image for post

3. As_index parameter

The groupby functions assigns the groups to the index of the returned dataframe. In case of nested groups, it does not look nice.

df[['Type','Regionname', 'Distance']]\
Image for post

If we want to perform analysis on this dataframe later on, it is not practical to have the type and region name columns as index. We can always use the reset_index function but there is a more optimal way.

If the as_index parameter of the groupby function is set to false, the grouped columns are represented as columns instead of index.

df[['Type','Regionname', 'Distance']]\
.groupby(['Type','Regionname'], as_index=False).mean().head()
Image for post

4. Missing values

The groupby function ignores the missing values by default. Let’s first update some of the values in the type column as missing.

df.iloc[100:150, 0] = np.nan

The iloc function selects row-column combinations by using indices. The code above updates the rows between 100 and 150 of the first column (0 index) as missing value (np.nan).

If we try to calculate the average distance for each category in the type column, we will not get any information about the missing values.

Image for post

In some case, we also need to get an overview of the missing values. It may affect how we aim to handle the them. The dropna parameter of the groupby function is used to also calculate the aggregations on the missing values.

df[['Type','Distance']].groupby('Type', dropna=False).mean()
Image for post


The groupby functions is one of the most frequently used functions in the exploratory data analysis process. It provides valuable insight into the relationships between variables.

It is important to use the groupby function efficiently to boost the data analysis process with Pandas. The 4 tips we have covered in this article will help you make the most of the groupby function.

Hyperplane in SVM Algorithm

In this post, we are going to introduce you to the Support Vector Machine (SVM) machine learning algorithm. We will follow a similar process to our recent post Naive Bayes for Dummies; A Simple Explanation by keeping it short and not overly-technical. The aim is to give those of you who are new to machine learning a basic understanding of the key concepts of this algorithm.

Support Vector Machines – What are they?

A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this post.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.


Support Vectors

Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.

What is a hyperplane?

As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.

How do we find the right hyperplane?

Or, in other words, how do we best segregate the two classes within the data?

The distance between the hyperplane and the nearest data point from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.


But what happens when there is no clear hyperplane?

This is where it can get tricky. Data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non separable dataset.

SVM< In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view. Explaining this is easiest with another simplified example. Imagine that our two sets of colored balls above are sitting on a sheet and this sheet is lifted suddenly, launching the balls into the air. While the balls are up in the air, you use the sheet to separate them. This ‘lifting’ of the balls represents the mapping of data into a higher dimension. This is known as kernelling. You can read more on Kerneling here.


Because we are now in three dimensions, our hyperplane can no longer be a line. It must now be a plane as shown in the example above. The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it.

Pros & Cons of Support Vector Machines


  • Accuracy
  • Works well on smaller cleaner datasets
  • It can be more efficient because it uses a subset of training points


  • Isn’t suited to larger datasets as the training time with SVMs can be high
  • Less effective on noisier datasets with overlapping classes

SVM Uses

SVM is used for text classification tasks such as category assignment, detecting spam and sentiment analysis. It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification. SVM also plays a vital role in many areas of handwritten digit recognition, such as postal automation services.

There you have it, a very high level introduction to Support Vector Machines.

Improve Your ML Model Accuracy

6 Ways to Improve Your ML Model Accuracy

Simple steps for much better results

Image for post

One of the most frustrating things that happen — more often than data scientists like to admit — after they spend hours upon hours gathering data, cleaning it, labeling it, and using it to train and develop a machine learning model is ending up with a model with low accuracy or large error range.

In machine learning, the term model accuracy refers to the measurements made to decide whether or not a certain model is the best to describe the relationship between the different problem variables. We often use training data (sample data) to train a model for new, unused data.

If our model has good accuracy, it will perform well on both the training data and the new one. Having a model with high accuracy is essential to the overall project’s success, and if you’re building it for a client, it’s important for your paycheck!

From a business perspective, performance equals money; if a model’s accuracy is low, it will result in more errors, which can be very costly. And I am not just talking about the financial aspect; imagine a model used to diagnose cancer or any other terminal diseases; a wrong diagnosis will not only cost the hospital money but will cost the patient and their family unnecessary emotional trauma.5 Types of Machine Learning Algorithms You Need to KnowIf you’re new to data science, here’s a good place to

So, have can we avoid all of that and improve the accuracy of our machine learning model? There are different ways a data scientist can use to improve their model’s accuracy; in this article, we will go through 6 of such ways. Let’s jump right in…

Most ML engineers are familiar with the quote, “Garbage in, garbage out”. Your model can perform only so much when the data it is trained upon is poorly representative of the actual scenario. What do I mean by ‘representative’? It refers to how well the training data population mimics the target population; the proportions of the different classes, or the point estimates (like mean, or median), and the variability (like variance, standard deviation, or interquartile range) of the training and target populations.

Generally, the larger the data, the more likely it is to be representative of the target population to which you want to generalize. if you want to generalize the population of students in Grade 1 to 12 of a school you cannot just use 80% of Grade 8 population because the data you want to predict will be faulty because of your dataset. It is crucial to have a good understanding of the distribution of your target population in order to devise the right data collection techniques. Once you have the data, study the data (the exploratory data analysis phase) in order to determine its distribution and representativeness.

Outliers, missing values, and outright wrong or false data are some of the other considerations that you might have. Should you cap outliers at a certain value? Or remove them entirely? How about normalizing the values? Should you include data with some missing values? Or use the mean or median values instead to replace the missing values? Does the data collection method support the integrity of the data? These are some of the questions that you must evaluate before thinking about the model. Data cleaning is probably the most important step after data collection.

Method 1: Add more data samples

Data tells a story only if you have enough of it. Every data sample provides some input and perspective to your data’s overall story is trying to tell you. Perhaps the easiest and most straightforward way to improve your model’s performance and increase its accuracy is to add more data samples to the training data.

Doing so will add more details to your data and finetune your model resulting in a more accurate performance. Rember after all, the more information you give your model, the more it will learn and the more cases it will be able to identify correctly.

Method 2: Look at the problem differently

Sometimes adding more data couldn’t be the answer to your model inaccuracy problem. You’re providing your model with a good technique and the correct dataset. But you’re not getting the results you hope for; why?

Maybe you’re just asking the wrong questions or trying to hear the wrong story. Looking at the problem from a new perspective can add valuable information to your model and help you uncover hidden relationships between the story variables. Asking different questions may lead to better results and, eventually, better accuracy.Data Science Lingo 101: 10 Terms You Need to Know as a Data ScientistYour guide to understanding basic data science

Method 3: Add some context to your data.

Context is important in any situation, and training a machine learning model is no different. Sometimes, one point of data can’t tell a story, so you need to add more context for any algorithm we intend to apply to this data to have a good performance.

More context can always lead to a better understanding of the problem and, eventually, better performance of the model. Imagine I tell you I am selling a car, a BMW. That alone doesn’t give you much information about the car. But, if I add the color, model and distance traveled, then you’ll start to have a better picture of the car and its possible value.

Method 4: Finetune your hyperparameter

Training a machine learning model is a skill that you can only hone with practice. Yes, there are rules you can follow to train your model, but these rules don’t give you the answer your seeking, only the way to reach that answer.

However, to get the answer, you will need to do some trial and error until you reach your answer. When I first started learning the different machine learning algorithms, such as the K-means, I was lost on choosing the best number of clusters to reach the optimal results. The way to optimize the results is to tune its hyper-parameters. So, tuning the parameters of the algorithm will always lead to better accuracy.

Method 5: Train your model using cross-validation

In machine learning, cross-validation is a technique used to enhance the model training process by dividing the overall training set into smaller chunks and then use each chunk to train the model.

Using this approach, we can enhance the algorithm’s training process but train it using the different chunks and averaging over the result. Cross-validation is used to optimize the model’s performance. This approach is very popular because it’s so simple and easy to implement.7 Tips For Data Science NewbiesTo help make your learning journey

Method 6: Experiment with a different algorithm.

What if you tried all the approaches we talked about so far and your model still results in a low or average accuracy? What then?

Sometimes we choose an algorithm to implement that doesn’t really apply to the data we have, and so we don’t get the results we expect. Changing the algorithm, you’re using to implement your solution. Trying out different algorithms will lead you to uncover more details about your data and the story it’s trying to tell.


One of the most difficult things to learn as a new data scientist and to master as a professional one is improving your machine learning model’s accuracy. If you’re a freelance developer, own your own company, or have a role as a data scientist, having a high accuracy model can make or break your entire project.

A machine learning model with low accuracy can cause more than just financial loss. If the model is used in a sensitive scope, such as any medical application, an error in that model can lead to trauma and emotional loss for people involved with that model’s results.5 Reasons Why Every Data Scientist Should BlogExplaining something gives you a deeper understanding of

Luckily, there are various simple yet efficient ways one can make to increase the accuracy of their model and save them much time, money, and effort that can be wasted on error mitigating if the model’s accuracy is low.

Improving the accuracy of a machine learning model is a skill that can only improve with practice. The more projects you build, the better your intuition will get about which approach you should use next time to improve your model’s accuracy. With time, your models will become more accurate and your projects more concrete.

Hyperparameters and Parameters …………………………………………………………………………………………………………………………………………………………..A Machine Learning model is defined as a mathematical model with a number of parameters that need to be learned from the data. By training a model with existing data, we are able to fit the model parameters.However, there is another kind of parameters, known Hyperparameters that cannot be directly learned from the regular training process. They are usually fixed before the actual training process begins. These parameters express important properties of the model such as its complexity or how fast it should learn.

Some examples of model hyperparameters include:

  1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
  2. The learning rate for training a neural network.
  3. The C and sigma hyperparameters for support vector machines.
  4. The k in k-nearest neighbors.

The aim of this article is to explore various strategies to tune hyperparameter for Machine learning model.

Models can have many hyperparameters and finding the best combination of parameters can be treated as a search problem. Two best strategies for Hyperparameter tuning are:

In GridSearchCV approach, machine learning model is evaluated for a range of hyperparameter values. This approach is called GridSearchCV, because it searches for best set of hyperparameters from a grid of hyperparameters values.

For example, if we want to set two hyperparameters C and Alpha of Logistic Regression Classifier model, with different set of values. The gridsearch technique will construct many versions of the model with all possible combinations of hyerparameters, and will return the best one.

As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4].
For a combination C=0.3 and Alpha=0.2, performance score comes out to be 0.726(Highest), therefore it is selected.

Following code illustrates how to use GridSearchCV

# Necessary imports from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV  # Creating the hyperparameter grid c_space = np.logspace(-5, 8, 15) param_grid = {'C': c_space}  # Instantiating logistic regression classifier logreg = LogisticRegression()  # Instantiating the GridSearchCV object logreg_cv = GridSearchCV(logreg, param_grid, cv = 5), y)  # Print the tuned parameters and score print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))  print("Best score is {}".format(logreg_cv.best_score_))


Tuned Logistic Regression Parameters: {‘C’: 3.7275937203149381}
Best score is 0.7708333333333334

Drawback : GridSearchCV will go through all the intermediate combinations of hyperparameters which makes grid search computationally very expensive.
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed number of hyperparameter settings. It moves within the grid in random fashion to find the best set hyperparameters. This approach reduces unnecessary computation.

Following code illustrates how to use RandomizedSearchCV

# Necessary imports from scipy.stats import randint from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import RandomizedSearchCV  # Creating the hyperparameter grid  param_dist = {"max_depth": [3, None], "max_features": randint(1, 9), "min_samples_leaf": randint(1, 9), "criterion": ["gini", "entropy"]}  # Instantiating Decision Tree classifier tree = DecisionTreeClassifier()  # Instantiating RandomizedSearchCV object tree_cv = RandomizedSearchCV(tree, param_dist, cv = 5), y)  # Print the tuned parameters and score print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_)) print("Best score is {}".format(tree_cv.best_score_))


Tuned Decision Tree Parameters: {‘min_samples_leaf’: 5, ‘max_depth’: 3, ‘max_features’: 5, ‘criterion’: ‘gini’}
Best score is 0.7265625

Decision Tree Classifier in Python and Scikit-Learn

Decision Tree Classifier for building a classification model using Python and Scikit

Decision Tree Classifier is a classification model that can be used for simple classification tasks where the data space is not huge and can be easily visualized. Despite being simple, it is showing very good results for simple tasks and outperforms other, more complicated models.

Article Overview:

  • Decision Tree Classifier Dataset
  • Decision Tree Classifier in Python with Scikit-Learn
  • Decision Tree Classifier – preprocessing
  • Training the Decision Tree Classifier model
  • Using our Decision Tree model for predictions
  • Decision Tree Visualisation

Decision Tree Classifier Dataset

Recently I’ve created a small dummy dataset to use for simple classification tasks. I’ll paste the dataset here again for your convenience.

Decision Tree Classifier - training data
Decision Tree Classifier – training data

The purpose of this data is, given 3 facts about a certain moment(the weather, whether it is a weekend or a workday or whether it is morning, lunch or evening), can we predict if there’s a traffic jam in the city?

Decision Tree Classifier in Python with Scikit-Learn

We have 3 dependencies to install for this project, so let’s install them now. Obviously, the first thing we need is the scikit-learn library, and then we need 2 more dependencies which we’ll use for visualization.

pip3 install scikit-learn
pip3 install matplotlib
pip3 install pydotplus

Decision Tree Classifier – installing dependencies

Now let’s import what we need from these packages.

from sklearn import preprocessing
from sklearn import tree
from IPython.display import Image
import pydotplus

Decision Tree Classifier – importing dependencies

def getWeather():
    return ['Clear', 'Clear', 'Clear', 'Clear', 'Clear', 'Clear',
            'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy',
            'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy']

def getTimeOfWeek():
    return ['Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend']

def getTimeOfDay():
    return ['Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',

def getTrafficJam():
    return ['Yes', 'No', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'Yes', 'No', 'Yes'

Decision Tree Classifier – loading the data

Decision Tree Classifier – preprocessing

We know that computers have a really hard time when dealing with text and we can make their lives easier by converting the text to numerical values.

Label Encoder

We will use this encoder provided by scikit to transform categorical data from text to numbers. If we have n possible values in our dataset, then LabelEncoder model will transform it into numbers from 0 to n-1 so that each textual value has a number representation.

For example, let’s encode our time of day values.

    timeOfDay = ['Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
    labelEncoder = preprocessing.LabelEncoder()
    encodedTimeOfDay = labelEncoder.fit_transform(timeOfDay)
    print (encodedTimeOfDay)
    # Prints [2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0]

Decision Tree Classifier – encoding our data

Training the Decision Tree Classifier model

Now let’s train our model. So remember, since all our features are textual values, we need to encode all our values and only then we can jump to training.

if __name__=="__main__":
    # Get the data
    weather = getWeather()
    timeOfWeek = getTimeOfWeek()
    timeOfDay = getTimeOfDay()
    trafficJam = getTrafficJam()

    labelEncoder = preprocessing.LabelEncoder()

    # Encode the features and the labels
    encodedWeather = labelEncoder.fit_transform(weather)
    encodedTimeOfWeek = labelEncoder.fit_transform(timeOfWeek)
    encodedTimeOfDay = labelEncoder.fit_transform(timeOfDay)
    encodedTrafficJam = labelEncoder.fit_transform(trafficJam)

    # Build the features
    features = []
    for i in range(len(encodedWeather)):
        features.append([encodedWeather[i], encodedTimeOfWeek[i], encodedTimeOfDay[i]])

    classifier = tree.DecisionTreeClassifier()
    classifier =, encodedTrafficJam)

Decision Tree Classifier – training our model

Using our Decision Tree model for predictions

Now we can use the model we have trained to make predictions about the traffic jam.

    # ["Snowy", "Workday", "Morning"]
    print(classifier.predict([[2, 1, 2]]))
    # Prints [1], meaning "Yes"
    # ["Clear", "Weekend", "Lunch"]
    print(classifier.predict([[0, 0, 1]]))
    # Prints [0], meaning "No"

Decision Tree Classifier – making predictions

And it seems to be working! It correctly predicts the traffic jam situations given our data.

Decision Tree Visualisation

Scikit also provides us with a way of visualizing a Decision Tree model. Here’s a quick helper method I wrote to generate a png image from our decision tree.

def printTree(classifier):
    feature_names = ['Weather', 'Time of Week', 'Time of Day']
    target_names = ['Yes', 'No']
    # Build the daya
    dot_data = tree.export_graphviz(classifier, out_file=None,
    # Build the graph
    graph = pydotplus.graph_from_dot_data(dot_data)

    # Show the image

Decision Tree Classifier – visualizing the decision tree

And here’s the result from that.

Decision Tree Classifier – visualization

Descriptive Statistics: Expectations vs. Reality (Exploratory Data Analysis)

An easy descriptive statistics approach to summarize the numeric and categoric data variables through the Measures of Central Tendency and Measures of Spread for every Exploratory Data Analysis process.

Image for post

About the Exploratory Data Analysis (EDA)

EDA is the first step in the data analysis process. It allows us to understand the data we are dealing with by describing and summarizing the dataset’s main characteristics, often through visual methods like bar and pie charts, histograms, boxplots, scatterplots, heatmaps, and many more.

Why is EDA important?

  • Maximize insight into a dataset (be able to listen to your data)
  • Uncover underlying structure/patterns
  • Detect outliers and anomalies
  • Extract and select important variables
  • Increase computational effenciency
  • Test underlying assumptions (e.g. business intuiton)

Moreover, to be capable of exploring and explain the dataset’s features with all its attributes getting insights and efficient numeric summaries of the data, we need help from Descriptive Statistics.

Statistics is divided into two major areas:

  • Descriptive statistics: describe and summarize data;
  • Inferential statistics: methods for using sample data to make general conclusions (inferences) about populations.

This tutorial focuses on descriptive statistics of both numerical and categorical variables and is divided into two parts:

  • Measures of central tendency;
  • Measures of spread.

Descriptive statistics

Also named Univariate Analysis (one feature analysis at a time), descriptive statistics, in short, help describe and understand the features of a specific dataset, by giving short numeric summaries about the sample and measures of the data.

Descriptive statistics are mere exploration as they do not allows us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made.

Numerical and categorical variables, as we will see shortly, have different descriptive statistics approaches.

Let’s review the type of variables:

Image for post
Type of variables — Image by author
  • Numerical continuous: The values are not countable and have an infinite number of possibilities (Someone’s age: 25 years, 4 days, 11 hours, 24 minutes, 5 seconds and so on to the infinite).
  • Numerical discrete: The values are countable and have an finite number of possibilities (It is impossible to count 27.52 countries in the EU).
  • Categorical ordinal: There is an order implied in the levels (January comes always before February and after December).
  • Categorical nominal: There is no order implied in the levels (Female/male, or the wind direction: north, south, east, west).

Numerical variables

Image for post
  • Measures of central tendency: Mean, median
  • Measures of spread: Standard deviation, variance, percentiles, maximum, minimum, skewness, kurtosis
  • Others: Size, unique, number of uniques

One approach to display the data is through a boxplot. It gives you the 5-basic-stats, such as the minimum, the 1st quartile (25th percentile), the median, the 3rd quartile (75th percentile), and the maximum.

Image for post

Categorical variables

Image for post
Bar plot of the categorical ordinal variable. Image by author
  • Measures of central tendency: Mode (most common)
  • Measures of spread: Number of uniques
  • Others: Size, % Highest unique


Measures of central tendency

  • Mean (average): The total sum of values divided by the total observations. The mean is highly sensitive to the outliers.
  • Median (center value): The total count of an ordered sequence of numbers divided by 2. The median is not affected by the outliers.
  • Mode (most common): The values most frequently observed. There can be more than one modal value in the same variable.

Measures of spread

  • Variance (variability from the mean): The square of the standard deviation. It is also affected by outliers.
  • Standard deviation (concentrated around the mean): The standard amount of deviation (distance) from the mean. The std is affected by the outliers. It is the square root of the variance.
  • Percentiles: The value below which a percentage of data falls. The 0th percentile is the minimum value, the 100th is the maximum, the 50th is the median.
  • Minimum: The smallest or lowest value.
  • Maximum: The greatest or highest value.
  • The number of uniques (total distinct): The total amount of distinct observations.
  • Uniques (distinct): The distinct values or groups of values observed.
  • Skewness (symmetric): How much a distribution derives from the normal distribution.
    >> Explained Skew concept in the next section.
  • Kurtosis (volume of outliers): How long are the tails and how sharp is the peak of the distribution.
    >> Explained Kurtosis concept in the next section.


  • Count (size): The total sum of observations. Counting is also necessary for calculating the mean, median, and mode.
  • % highest unique (relativity): The proportion of the highest unique observation regarding all the unique values or group of values.


In a perfect world, the data’s distribution assumes the form of a bell curve (Gaussian or normally distributed), but in the real world, data distributions usually are not symmetric (= skewed).

Therefore, the skewness indicates how much our distribution derives from the normal distribution (with the skewness value of zero or very close).

Image for post
Skewness curves. Image by author

There are three generic types of distributions:

  • Symmetrical [median = mean]: In a normal distribution, the mean (average) divides the data symmetrically at the median value or close.
  • Positive skew [median < mean]: The distribution is asymmetrical, the tail is skewed/longer towards the right-hand side of the curve. In this type, the majority of the observations are concentrated on the left tail, and the value of skewness is positive.
  • Negative skew [median > mean]: The distribution is asymmetrical and the tail is skewed/longer towards the left-hand side of the curve. In this type of distribution, the majority of the observations are concentrated on the right tail, and the value of skewness is negative.

Rule of thumbs:

  • Symmetric distribution: values between –0.5 to 0.5.
  • Moderate skew: values between –1 and -0.5 and 0.5 and 1.
  • High skew: values <-1 or >1.


kurtosis is another useful tool when it comes to quantify the shape of a distribution. It measures both how long are the tails, but most important, and how sharp is the peak of the distributions.

If the distribution has a sharper and taller peak and shorter tails, then it has a higher kurtosis while a low kurtosis can be observed when the peak of the distribution is flatter with thinner tails. There are three types of kurtosis:

Image for post
  • Leptokurtic: The distribution is tall and thin. The value of a leptokurtic must be > 3.
  • Mesokurtic: This distribution looks the same or very similar to a normal distribution. The value of a “normal” mesokurtic is = 3.
  • Platykurtic: The distributions have a flatter and wider peak and thinner tails, meaning that the data is moderately spread out. The value of a platykurtic must be < 3.

The kurtosis values determine the volume of the outliers only.

Kurtosis is calculated by raising the average of the standardized data to the fourth power. If we raise any standardized number (less than 1) to the 4th power, the result would be a very small number, somewhere close to zero. Such a small value would not contribute much to the kurtosis. The conclusion is that the values that would make a difference to the kurtosis would be the ones far away from the region of the peak, put it in other words, the outliers.

The Jupyter notebook — IPython

In this section, we will be giving short numeric stats summaries concerning the different measures of central tendency and dispersion of the dataset.

let’s work on some practical examples through a descriptive statistics environment in Pandas.

Start by importing the required libraries:

import pandas as pd
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load the dataset:
df = pd.read_csv("sample.csv", sep=";")

Print the data:

Image for post

Before any stats calculus, let’s just take a quick look at the data:

Image for post
Image by author

The dataset consists of 310 observations and 2 columns. One of the attributes is numerical, and the other categorical. Both columns have no missing values.

Numerical variable

The numerical variable we are going to analyze is age. First step is to visually observe the variable. So let’s plot an histogram and a boxplot.

plt.hist(df.age, bins=20)
plt.ylabel(“Absolute Frequency”)
Image for post
sns.boxplot(x=age, data=df, orient="h").set(xlabel="Age", title="Numeric variable 'Age'");
Image for post

It is also possible to visually observe the variable with both a histogram and a boxplot combined. I find it a useful graphical combination and use it a lot in my reports.

age = df.agef, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.8, 1.2)})mean=np.array(age).mean()
median=np.median(age)sns.boxplot(age, ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')sns.distplot(age, ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')plt.legend({'Mean':mean,'Median':median})
plt.title("'Age' histogram + boxplot")ax_box.set(xlabel='')
Image for post

Measures of central tendency

1. Mean:


2. Median:


Measures of spread

3. Standard deviation:


4. Variance:


5. a) Percentiles 25%:


b) Percentile 75%:


c) In one go:
df.age.quantile(q=[.25, .75)

0.25    23.0
0.75 45.0
Name: age, dtype: float64

6. Minimum and maximum:
df.age.min(), df.age.max()

(3, 98)

7. Skewness (with scipy):


8. Kurtosis (with scipy):



9. Size (number of rows):


10. Number of uniques (total distinct)


11. Uniques (distinct):

array([46, 22, 54, 33, 69, 35, 11, 97, 50, 34, 67, 43, 21, 12, 23, 45, 89, 76, 5, 55, 65, 24, 27, 57, 38, 28, 36, 60, 56, 53, 26, 25, 42, 83, 16, 51, 90, 10, 70, 44, 20, 31, 47, 30, 91, 7, 6, 41, 66, 61, 96, 32, 58, 17, 52, 29, 75, 86, 98, 48, 40, 13, 4, 68, 62, 9, 18, 39, 15, 19,  8, 71, 3, 37])

Categorical variable

The categorical variable we are going to analyze is city. Let’s plot a bar chart and get a visual observation of the variable.
plt.ylabel("Absolute Frequency")
plt.title("Categoric variable 'City'")
Image for post

Measures of central tendency

1. Mode:[0]


Measures of spread

2. Number of uniques:


3. Uniques (distinct):

array(['Lisbon', 'Paris', 'Madrid', 'London', 'Luxembourg', 'Berlin'], dtype=object)

4. Most frequent unique (value count):

Paris     67
Name: city, dtype: int64


5. Size (number of rows):


6. % of the highest unique (fraction of the most common unique in regards to all the others):
p =[0]


The describe() method shows the descriptive statistics gathered in one table. By default, stats for numeric data. The result is represented as a pandas dataframe.

Image for post

Adding other non-standard values, for instance, the ‘variance’.
describe_var = data.describe()
describe_var.append(pd.Series(data.var(), name='variance'))

Image for post

Displaying categorical data.
<=> df.describe(exclude=['float64','int64'])
<=> df.describe(include=[np.object])

Image for post

By passing the parameter include='all', displays both numeric and categoric variables at once.

Image for post


These are the basics of descriptive statistics when developing an exploratory data analysis project with the help of Pandas, Numpy, Scipy, Matplolib and/or Seaborn. When well performed, these stats help us to understand and transform the data for further processing.

10 Must-Know Statistical Concepts for Data Scientists

Statistics is a building block of data science

Image for post

Data science is an interdisciplinary field. One of the building blocks of data science is statistics. Without a decent level of statistics knowledge, it would be highly difficult to understand or interpret the data.

Statistics helps us explain the data. We use statistics to infer results about a population based on a sample drawn from that population. Furthermore, machine learning and statistics have plenty of overlaps.

Long story short, one needs to study and learn statistics and its concepts to become a data scientist. In this article, I will try to explain 10 fundamental statistical concepts.

1. Population and sample

Populationis all elements in a group. For example, college students in US is a population that includes all of the college students in US. 25-year-old people in Europe is a population that includes all of the people that fits the description.

It is not always feasible or possible to do analysis on population because we cannot collect all the data of a population. Therefore, we use samples.

Sampleis a subset of a population. For example, 1000 college students in US is a subset of “college students in US” population.

2. Normal distribution

Probability distribution is a function that shows the probabilities of the outcomes of an event or experiment. Consider a feature (i.e. column) in a dataframe. This feature is a variable and its probability distribution function shows the likelihood of the values it can take.

Probability distribution functions are quite useful in predictive analytics or machine learning. We can make predictions about a population based on the probability distribution function of a sample from that population.

Normal (Gaussian) distribution is a probability distribution function that looks like a bell.

Image for post
A typical normal distribution curve (image by author)

The peak of the curve indicates the most likely value the variable can take. As we move away from the peak the probability of the values decrease.

3. Measures of central tendency

Central tendency is the central (or typical) value of a probability distribution. The most common measures of central tendency are mean, median, and mode.

  • Mean is the average of the values in series.
  • Median is the value in the middle when values are sorted in ascending or descending order.
  • Mode is the value that appears most often.

4. Variance and standard deviation

Variance is a measure of the variation among values. It is calculated by adding up squared differences of each value and the mean and then dividing the sum by the number of samples.

Image for post
(image by author)

Standard deviation is a measure of how spread out the values are. To be more specific, it is the square root of variance.

Note: Mean, median, mode, variance, and standard deviation are basic descriptive statistics that help to explain a variable.

5. Covariance and correlation

Covariance is a quantitative measure that represents how much the variations of two variables match each other. To be more specific, covariance compares two variables in terms of the deviations from their mean (or expected) value.

The figure below shows some values of the random variables X and Y. The orange dot represents the mean of these variables. The values change similarly with respect to the mean value of the variables. Thus, there is positive covariance between X and Y.

Image for post
(image by author)

The formula for covariance of two random variables:

Image for post
(image by author)

where E is the expected value and µ is the mean.

Note: The covariance of a variable with itself is the variance of that variable.

Correlation is a normalization of covariance by the standard deviation of each variable.

Image for post
(image by author)

where σ is the standard deviation.

This normalization cancels out the units and the correlation value is always between 0 and 1. Please note that this is the absolute value. In case of a negative correlation between two variables, the correlation is between 0 and -1. If we are comparing the relationship among three or more variables, it is better to use correlation because the value ranges or unit may cause false assumptions.

6. Central limit theorem

In many fields including natural and social sciences, when the distribution of a random variable is unknown, normal distribution is used.

Central limit theorem (CLT) justifies why normal distribution can be used in such cases. According to the CLT, as we take more samples from a distribution, the sample averages will tend towards a normal distribution regardless of the population distribution.

Consider a case that we need to learn the distribution of the heights of all 20-year-old people in a country. It is almost impossible and, of course not practical, to collect this data. So, we take samples of 20-year-old people across the country and calculate the average height of the people in samples. CLT states that as we take more samples from the population, sampling distribution will get close to a normal distribution.

Why is it so important to have a normal distribution? Normal distribution is described in terms of mean and standard deviation which can easily be calculated. And, if we know the mean and standard deviation of a normal distribution, we can compute pretty much everything about it.

7. P-value

P-value is a measure of the likelihood of a value that a random variable takes. Consider we have a random variable A and the value x. The p-value of x is the probability that A takes the value x or any value that has the same or less chance to be observed. The figure below shows the probability distribution of A. It is highly likely to observe a value around 10. As the values get higher or lower, the probabilities decrease.

Image for post
Probability distribution of A (image by author)

We have another random variable B and want to see if B is greater than A. The average sample means obtained from B is 12.5 . The p value for 12.5 is the green area in the graph below. The green area indicates the probability of getting 12.5 or a more extreme value (higher than 12.5 in our case).

Image for post
(image by author)

Let’s say the p value is 0.11 but how do we interpret it? A p value of 0.11 means that we are 89% sure of the results. In other words, there is 11% chance that the results are due to random chance. Similarly, a p value of 0.05 means that there is 5% chance that the results are due to random chance.

Note: Lower p values show more certainty in the result.

If the average of sample means from the random variable B turns out to be 15 which is a more extreme value, the p value will be lower than 0.11.

Image for post
(image by author)

8. Expected value of random variables

The expected value of a random variable is the weighted average of all possible values of the variable. The weight here means the probability of the random variable taking a specific value.

The expected value is calculated differently for discrete and continuous random variables.

  • Discrete random variables take finitely many or countably infinitely many values. The number of rainy days in a year is a discrete random variable.
  • Continuous random variables take uncountably infinitely many values. For instance, the time it takes from your home to the office is a continuous random variable. Depending on how you measure it (minutes, seconds, nanoseconds, and so on), it takes uncountably infinitely many values.

The formula for the expected value of a discrete random variable is:

Image for post
(image by author)

The expected value of a continuous random variable is calculated with the same logic but using different methods. Since continuous random variables can take uncountably infinitely many values, we cannot talk about a variable taking a specific value. We rather focus on value ranges.

In order to calculate the probability of value ranges, probability density functions (PDF) are used. PDF is a function that specifies the probability of a random variable taking value within a particular range.

Image for post
(image by author)

9. Conditional probability

Probability simply means the likelihood of an event to occur and always takes a value between 0 and 1 (0 and 1 inclusive). The probability of event A is denoted as p(A) and calculated as the number of the desired outcome divided by the number of all outcomes. For example, when you roll a die, the probability of getting a number less than three is 2 / 6. The number of desired outcomes is 2 (1 and 2); the number of total outcomes is 6.

Conditional probability is the likelihood of an event A to occur given that another event that has a relation with event A has already occurred.

Suppose that we have 6 blue balls and 4 yellows placed in two boxes as seen below. I ask you to randomly pick a ball. The probability of getting a blue ball is 6 / 10 = 0,6. What if I ask you to pick a ball from box A? The probability of picking a blue ball clearly decreases. The condition here is to pick from box A which clearly changes the probability of the event (picking a blue ball). The probability of event A given that event B has occurred is denoted as p(A|B).

Image for post
(image by author)

10. Bayes’ theorem

According to Bayes’ theorem, probability of event A given that event B has already occurred can be calculated using the probabilities of event A and event B and probability of event B given that A has already occurred.

Image for post
(image by author)

Bayes’ theorem is so fundamental and ubiquitous that a field called “bayesian statistics” exists. In bayesian statistics, the probability of an event or hypothesis as evidence comes into play. Therefore, prior probabilities and posterior probabilities differ depending on the evidence.

Naive bayes algorithm is structured by combining bayes’ theorem and some naive assumptions. Naive bayes algorithm assumes that features are independent of each other and there is no correlation between features.


We have covered some basic yet fundamental statistical concepts. If you are working or plan to work in the field of data science, you are likely to encounter these concepts.

There is, of course, much more to learn about statistics. Once you understand the basics, you can steadily build your way up to advanced topics.