Exploratory Data Analysis or EDA in Data Science made easy with SWEETVIZ, PANDAS PROFILING and Streamlit

If you are not a statistician but you want to get to grip with data-science like me, I have just applied these 2 simple advices. Here what they are :

It is sometimes easier to redefine with your own words some statistic’s definitions. But, more than that…
It is strongly advise to leverage on DATAVIZ to clarify these “hermetic” concepts and leave the calculus to the libraries itself such as PANDAS PROFILING and SWEETVIZ. By chance, that is this post’s topic.

Like often, I will be using Streamlit to explore the CSV and generate the reports with the help of SWEETVIZ or PANDAS PROFILING.

All the files are available on my GitHub account https://github.com/bflaven/BlogArticlesExamples/tree/master/streamlit-sweetviz-pandas-profiling-eda-made-easy

I. Other remarks regarding Streamlit

I have noticed in most applications, made with Streamlit, resorted to upload in order to provide a source for potential analysis e.g for an EDA. Here is a quick reminder why I am not using upload feature in my Streamlit applications. For personal reasons, I’d rather like parsing CSV files than uploading it as source files. It is also because, my personal usage is to run these apps on a local computer for personal usage as a productivity tools.

Let’s start giving some simple definition both for Dataset or EDA.

II. Oversimplified Dataset’s definition

First, I am oversimplifying a definition because the word “Dataset” is key. So, what is the dataset? To say it simply, a dataset is a table with columns. Each of these columns has a title and contains a series of numbers vertically.

III. Did you ever heard the word EDA?

EDA is to the Data Scientist what “Pan, Shovel and Pickaxe”* was to the prospector during the Gold Rush! ESSENTIAL.
*an old version of PSP

A mid-life data scientist equipped with his tools: Pan, Shovel and Pickaxe assisted by his young trainee.

More seriously, the acronym EDA stands for “Exploratory Data Analysis” (EDA in Data Science). Basically, “the initial analysis of data supplied or extracted, to understand the trends, underlying limitations, quality, patterns, and relationships between various entities within a dataset, using descriptive statistics and visualization tools is called Exploratory Data Analysis (EDA).”

This preliminary exploration of the data is essential. It is a prerequisite before submitting this same data to Machine Learning and Artificial Intelligence algorithms. Depending on the nature and amount of data you have, the exploration work can be tedious and time-consuming! You bet. This is where libraries like PANDAS PROFILING or SWEETVIZ can speed up a bit the data process and help you out to detect: outliers and anomalies, data’s quality, statistical models can fit the data… and so on.

Source : https://www.jigsawacademy.com/blogs/data-science/eda-in-data-science

IV. EDA can be tedious… (tedious_manual_eda)

Let’s start with an example with a very basic data exploration. It can be time consuming if you want extract relevant information. Nevertheless, the advantage of do-it by yourself will provide you a good chance to fully understand all the operation required for a EDA.

Check out the directory tedious_manual_eda

The example is extracted from this article : https://towardsdatascience.com/exploring-a-data-set-with-simple-pandas-and-plot-visualizations-features-73901ee76c6c

For PANDAS PROFILING: I have just made a quick search but there is no noticing about it. It is mainly official documentation.

Check out the directory streamlit_eda_made_easy_pandas_profiling_4

For SWEETVIZ: there is great article, very educational and accessible, it deserves to be underlined: https://coderzcolumn.com/tutorials/data-science/sweetviz-automate-exploratory-data-analysis-eda
Check out the directory streamlit_eda_made_easy_sweetviz_3

V. EDA with PANDAS PROFILING (streamlit_eda_made_easy_pandas_profiling_4)

A great and robust library that do the job one for all. It generates automatically a tremendous amount of information about the selected dataset. At the end, it outputs a nice interactive and structured report named “Pandas Profiling Report” with the following topics : Overview, Variables, Interactions, Correlations, Missing values, Sample. An handy “swiss army knife” to investigate quickly dataset and generate clean report.

VI. EDA with SWEETVIZ (streamlit_eda_made_easy_sweetviz_3)

Check out 009_streamlit_webapp_sweetviz.py and 010_streamlit_webapp_sweetviz.py

All my knowledge came from this article: https://coderzcolumn.com/tutorials/data-science/sweetviz-automate-exploratory-data-analysis-eda. Again, I strongly invite you to read this article that is the best I have found on the web. I have just extended some elements from this article by giving some information and migrate the all stuff presented into Streamlit framework.

This library is a bit more complex. Like I said before I found a an in-depth article on “coderzcolumn.com” that present most of the assets that you can withdraw from a dataset with the help of SWEETVIZ. I will try to explain with my own simple words, my understating both of the library and the article.

So we start by describing each functions available in the library SWEETVIZ.

(i) analyse function

With SWEETVIZ, the main idea is to compare a chosen column (target) to all the other columns from your dataset. The objective is to measure and visualize if there is a link (correlation) between the chosen column (target) and each of the columns of the table.

Intuitively, because we all have a bit of a statistician in us, we almost know what is a correlation for instance. As an example, you probably know by now that there is a correlation closed to one between the probability of having lung’s cancer and cigarettes’ consumption or if you do bicycle, you probably know there is a correlation between the number of head injuries and the fact of NOT wearing an helmet!
That is being said, let’s dig into SWEETVIZ…

Source : https://data-driven-everything.com/2021/10/07/does-smoking-cause-lung-cancer-a-case-on-causation-v-s-correlation/

(ii) Leveraging on scikit-learn toy datasets
In its great leniency, scikit-learn provides toy datasets that will be helpful to instantiate both comparisons column vs column, table vs table…

Command to import Toy datasets provided by scikit-learn

from sklearn import datasets

The Toy datasets used in this article like in the source are: Boston house prices dataset and Diabetes dataset

    <b>EXAMPLES OF FREQUENT IA USAGES</b>
    7.1. Toy datasets
    7.1.1. Boston house prices dataset
    7.1.2. Iris plants dataset
    7.1.3. Diabetes dataset
    7.1.4. Optical recognition of handwritten digits dataset
    7.1.5. Linnerrud dataset
    7.1.6. Wine recognition dataset
    7.1.7. Breast cancer wisconsin (diagnostic) dataset

Source: https://scikit-learn.org/stable/datasets/toy_dataset.html

(iii) some infos about the Boston house prices dataset
The “columns” from the Boston house prices dataset

    CRIM per capita crime rate by town
    ZN proportion of residential land zoned for lots over 25,000 sq.ft.
    INDUS proportion of non-retail business acres per town
    CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    NOX nitric oxides concentration (parts per 10 million)
    RM average number of rooms per dwelling
    AGE proportion of owner-occupied units built prior to 1940
    DIS weighted distances to five Boston employment centres
    RAD index of accessibility to radial highways
    TAX full-value property-tax rate per $10,000
    PTRATIO pupil-teacher ratio by town
    B 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
    LSTAT % lower status of the population
    MEDV Median value of owner-occupied homes in $1000’s

(iv) some infos about the Diabetes dataset
The “columns” from the Diabetes dataset

    age age in years
    sex
    bmi body mass index
    bp average blood pressure
    s1 tc, total serum cholesterol
    s2 ldl, low-density lipoproteins
    s3 hdl, high-density lipoproteins
    s4 tch, total cholesterol / HDL
    s5 ltg, possibly log of serum triglycerides level
    s6 glu, blood sugar level

(v) compare function
From the comparison of one column to another, we can move on to the comparison of table to table. This method can be useful for performing EDA on a combination of train/test, train/validation, and test/validation datasets.

1. Single Dataset Analysis

To understand the explanations, you should load in your browser one of these files: sweetviz_wine_report.html, sweetviz_diabetes_report.html, sweetviz_boston_report.html

Check 009_streamlit_webapp_sweetviz.py
You can see the result in sweetviz_wine_report.html, sweetviz_diabetes_report.html, sweetviz_boston_report.html

Check 009_streamlit_webapp_sweetviz.py, below each line you will be able to output the report

# --- SCREEN_1 :: WINES ---
 
# --- SCREEN_2 :: DIABETES ---
 
# --- SCREEN_3 :: BOSTON ---

1.1 Using analyze()
If you load analyze(). It shows some general information about the dataframe. Like any summary stats you get with the help of SWEETVIZ general informations about the dataset you want to analyze.

1.2 Associations button

This button is very interesting and helpful. When you click on it, you can see a correlation “heatmap” showing the correlation between all features of the dataset and only numeric features.
How can read this heatmap at a glance? It has either squares or circles present in each tile. The circles represent Pearson correlation in the range [-1, 1] and so the next question is what is Pearson correlation !!?? See the point below.

1.3 Pearson correlation’s most simple definition
How can you interpret the correlation meaning ? What do the terms positive and negative mean? Positive correlation implies that as one variable increases as the other increases as well. Inversely, a negative correlation implies that as one variable increases, the other decreases.

For the Pearson correlation, see that source: https://datagy.io/python-pearson-correlation/
In order to calculate and plot a Correlation Matrix in Python and Pandas, see the source https://datagy.io/python-correlation-matrix/

I can not say it better than the other article so here is the extended extract to give an in-depth “heatmap” interpretation.

The squares represent categorical associations. The categorical associations go row-wise and show how much association a feature represented by row name on left has with all other features of data. The categorical associations range from [0,1]. The heatmap will have a circle whenever showing the relation between numerical features and squares when showing the relation between categorical features or numerical and categorical features. The diagonal of the chart is left blank as each feature has a total relationship with itself. In our example, the WineType feature is categorical hence row and column representing WineType has squares whereas all other cells have circles because all other features are numerical.

1.4 Individual Column Stats

For each dataset’s feature (column), you can access to quick summary per feature (column)… The tab has basic stats about the feature like total values, missing count, min, max, median, average, quantiles, range, standard deviation, etc.
There is also an histogram showing the distribution of feature (column) data among the dataset. If you click on the tab, you get a more detailed histogram and elements about the feature (column).

2. Target Variable Analysis

The underlying idea is to designate a target variable (target) then to test the correlation of all the variables of the sample (dataset) with respect to this variable designated as target.

2.1 Target Variable Details

2.1.1 EXAMPLE_3 in 010_streamlit_webapp_sweetviz.py. You can see the result in example_3_wine_df_proline_magnesium_streamlit_webapp_sweetviz.html

This analysis is specific, the main purpose is to “skip columns “proline” and “magnesium” from original dataset and instructed to use WineType column as numerical using FeatureConfig constructor.”

2.1.2 EXAMPLE_4 in 010_streamlit_webapp_sweetviz.py. You can see the result in example_4_wine_df_pairwise_off_streamlit_webapp_sweetviz.html

This time, the game is to “not to include pairwise relationships between features.” As a consequence, it exclude the button “Associations”.

2.1.3 EXAMPLE_5 in 010_streamlit_webapp_sweetviz.py. You can see the result in example_5_diabetes_df_compare_2_datasets_streamlit_webapp_sweetviz.html

The target variable tab is “painted in black” to differentiate it from other columns. It is handy because it will help
you to understand “the relationship between the target variable and feature based on feature values.”

3. Compare Two datasets

3.1 EXAMPLE_5 in 010_streamlit_webapp_sweetviz.py. You can see the result in example_5_diabetes_df_compare_2_datasets_streamlit_webapp_sweetviz.html

This time is idea is to leverage on the ability of SWEETVIZ to compare two datasets. It will show a different distribution of data between two datasets.

report = sweetviz.compare(source=train_df, compare=test_df, target_feat="Progression")
report.show_html()

3.2 EXAMPLE_6 in 010_streamlit_webapp_sweetviz.py. You can see the result in example_6_diabetes_df_compare_2_datasets_streamlit_webapp_sweetviz.html

You can also compare train/test, train/validation, test/test and validation/test datasets. So, SWEETVIZ let us generate EDA for two datasets using compare()* method. It’ll show EDA for datasets next to each other.

The idea is to divide the diabetes dataset into train (80%) and test (20%) sets using scikit-learn’s train_test_split() method and then it will compare these two datasets.

report = sweetviz.compare(source=[train_df,"Train Set"], compare=[test_df, "Validation Set"], target_feat="Progression")
report.show_html()

4. Divide Dataset using boolean variable and Compare them

4.1 EXAMPLE_7 in 010_streamlit_webapp_sweetviz.py. You can see the result in example_7_boston_df_compare_2_datasets_streamlit_webapp_sweetviz.html

It is sometimes useful to understand more deeply the dataset and especially used a boolean column of dataset. For instance, to make comparison to see EDA for all rows with gender male v/s all rows with gender female.

For a newbie like me, it sounds close to categorical variables that are used in Categorical Summary. These variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level to produce Categorical Summary

You can then change the method and perform different comparison EDA.
For instance, that is the purpose used for the EDA with the Boston housing dataset. The CHAS variable inside of the Boston hosing dataset has boolean information about whether houses are on the bounds of a river or not.

Then, it will be possible with the compare_intra() to divide into two datasets based on boolean values of column CHAS and generate EDA comparing those two datasets.

For the record, CHAS is “Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

All these methods have a unique purpose to generate easily reports in HTML quickly and with relevant key indicators on any type of dataset. For sure, SWEETVIZ do the job greatly! That is one of the best tool for EDA generation.

VII. EDA made with a Class (streamlit_eda_made_easy_1)

On GitHub, you can find different approach to build up an EDA. Here is one made by Marina Ramalhete from Brazil that is pretty advanced as is it designed with a Class. I have rework few elements to make it more readable and deliver more contextual information when checking the data. You can check the original repository at https://github.com/marinaramalhete/Exploratory-Data-Analysis-App

Check out the directory streamlit_eda_made_easy_1

VIII. EDA example sample (streamlit_eda_made_easy_2)

Few examples gathered from different sources mostly books or github, made qith Streamlit.

Check out the directory streamlit_eda_made_easy_2

IX. Extra Stuff

Few things or commands that I have used during the post writing process and I want to keep track of it.
1. Command to create a requirements.txt for your python project.

    # freeze requirements python
    pip freeze > requirements_1_heroku_python_getting_started_3.txt

2. In python, grab filename and split it

# grab filename and split it in python
import os
base=os.path.basename('/root/dir/sub/file.ext')
base
'file.ext'
os.path.splitext(base)
('file', '.ext')
os.path.splitext(base)[0]
'file'

3. Split url in javascript using the console

# split url in js using the console
let url = "https://flaven.fr/2021/12/quick-poc-for-a-all-in-one-that-provides-an-seo-dashboard-made-with-streamlit-managing-screaming-frog-automation-storing-results-in-a-database-sqlite-and-create-data-analysis-graphics-for-seo-repo/";
console.log(url);
 
let pathArray = url.split('/');
console.log(pathArray);
 
let last = pathArray.pop();
console.log(last);

Source:

6. Videos

3 additional videos to tackle this post

Part 1 EDA in Data Science made easy with SWEETVIZ, PANDAS PROFILING and Streamlit
Part 2 EDA in Data Science made easy with SWEETVIZ, PANDAS PROFILING and Streamlit
Part 3 EDA in Data Science made easy with SWEETVIZ, PANDAS PROFILING and Streamlit