Boost Your Exploratory Data Analysis with Pandas Profiling
Exploratory Data Analysis (EDA) is one of the most important part of any data science work. It is quite hard to imagine a model without EDA. EDA is a general approach of identifying characteristics of the data we are working on by visualizing the dataset. There is a package called ‘Pandas Profiling’ with which we can have much analysis with just a single line code.
Pandas Profiling is a simple and fast way to perform exploratory data analysis of a Pandas Dataframe. The pandas
df.describe(), info(), isnull(), etc, function is great and it gives you a compact summary of your data, but it is very basic for serious exploratory data analysis.
pandas_profiling extends the pandas DataFrame with
df.profile_report() for quick data analysis. This library generates a complete report for your dataset, which includes:
- Basic data type information (which columns contain what)
- Descriptive statistics (mean, average, etc.)
- Quantile statistics (tells you about how your data is distributed)
- Histograms for your data (again, for visualizing distributions)
- Correlations (Let’s you see what’s related)
- And more!
So this is how you install it:
You can install using the pip package manager by running
pip install pandas-profiling[notebook]
You can also install using the conda package manager by running
conda install -c conda-forge pandas-profiling
Start by loading in your pandas DataFrame
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReportdf = pd.read_csv('my_data.csv')
To generate the report, run:
profile = ProfileReport(df, title="Pandas Profiling Report")
Saving the report
If you want to generate a HTML report file, save the
ProfileReport to an object and use the
Alternatively, you can obtain the data as json:
# As a string
json_data = profile.to_json()# As a file
Version 2.4 introduces minimal mode. This is a default configuration that disables expensive computations (such as correlations and dynamic binning). Use the following syntax:
profile = ProfileReport(large_dataset, minimal=True)
Ok, now we are going to explore and see what Pandas Profiling looks like and what it has in different datasets.
Below is the report generated contains a general overview and different sections for different characteristics of attributes of the dataset. There are columns we expect for Airbnb data, like price, number of reviews and minimum nights. Really quickly we can get a sense of what we are dealing with in our dataset. For example, the column “neighbourhood_group” is rejected, since it never has values (nan).
The report also shows which attributes have missing values. Each variable may have its missing values, and this tab provides information about how much of them is missing. There are a couple thousand listings without reviews. For the rest, the dataset looks complete.
Below we have a different dataset with different variables. With the report we can see all the variables in the dataset and their properties.
In the report we can go to the Variables section to get the information of every feature individually unlike Overview sections which provides information on the whole data set.
We can also view the interaction of different attributes of the dataset with each other. For example, in this Grad Acceptance dataset we can see the interaction between CGPA score and the chance of admission.
The report generated contains different types of correlations. You can get an understanding of the relationship between the features. You can also toggle and see different correlations like Pearson, Spearman, Kendall, and phik.
This section displays 1st 10 data points (head of 10) and the bottom 10 data points (tail of 10).
Exploratory Data analysis (EDA)is one of the first steps that is performed by anyone who is doing data analysis. It is important to know everything about data first rather than directly building models over it. The ‘Pandas Profiling’ package is a powerful tool for data analysis. With just a few lines of code, you get a very comprehensive report about the dataset.