What is Distribution Plots?#

  • Flexibly plot a univariate distribution of observations.

  • This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions. It can also fit scipy.stats distributions and plot the estimated PDF over the data.

Let’s discuss some plots that allow us to visualize the distribution of a dataset. These plots are:#

  • distplot()

  • jointplot()

  • pairplot()

  • rugplot()

  • kdeplot()

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
num = np.random.randn(150)
sns.distplot(num,color ='green')
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
<AxesSubplot:ylabel='Density'>
../../../_images/17730eb451be2304eb9650ff2343ff5ec30b28e2417ddcd61542fea587f4e607.png
label_dist = pd.Series(num,name = " Variable x")
sns.distplot(label_dist,color = "red")
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel=' Variable x', ylabel='Density'>
../../../_images/bdafbb240c7747b509528aa2fcf2065f1d29bf7e0d76268d2fed3dc26b4fdb84.png
# Plot the distribution with a kenel density. estimate and rug plot:

sns.distplot(label_dist,hist = False,color = "red")
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel=' Variable x', ylabel='Density'>
../../../_images/7f043c07fa7afb66690d5f4d7ae8915c11aaab73a1a42caf257850f74a4f6276.png
# Plot the distribution with a kenel density estimate and rug plot:

sns.distplot(label_dist,rug = True,hist = False,color = "red")
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel density plots).
  warnings.warn(msg, FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2056: FutureWarning: The `axis` variable is no longer used and will be removed. Instead, assign variables directly to `x` or `y`.
  warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel=' Variable x', ylabel='Density'>
../../../_images/1c112fb20c6162effcb95a0ece20853416d1999acd8e62505eca6967900a4bf7.png
# Plot the distribution with a histogram and maximum likelihood gaussian distribution fit:

from scipy.stats import norm
sns.distplot(label_dist, fit=norm, kde=False)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel=' Variable x'>
../../../_images/650b0376b9450fda4224df1d2ffcb1cddbc13bda1d85e716d3854d1e0ead140f.png

Plot the distribution on the vertical axis:#

sns.distplot(label_dist, vertical =True)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:1647: FutureWarning: The `vertical` parameter is deprecated and will be removed in a future version. Assign the data to the `y` variable instead.
  warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='Density', ylabel=' Variable x'>
../../../_images/0adfa774dc1ac67c01489ef036ef9f685a38f15b869a43ab2f9f105cfa80c4b7.png

Let’s implement with dataset#

Data#

Seaborn comes with built-in data sets!

tips = sns.load_dataset('tips')
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

1 distplot()#

The distplot() shows the distribution of a univariate set of observations.

sns.distplot(tips['total_bill'])
# Safe to ignore warnings
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='total_bill', ylabel='Density'>
../../../_images/b9c3b03b0021e8ed5a31fbeea2160fc1d0b35467987ce821e56594f6fc67264f.png
sns.distplot(tips['total_bill'],kde=False,bins=30)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='total_bill'>
../../../_images/3b85dc636bf167bfce67e365cde8bc8e2d525cc753de0890195a3ff4fa6c4bbb.png

2 jointplot()#

jointplot() allows you to basically match up two distplots for bivariate data. With your choice of what kind parameter to compare with:

  • scatter

  • reg

  • resid

  • kde

  • hex

# 'scatter'

sns.jointplot(x='total_bill',y='tip',data=tips,kind='scatter')
<seaborn.axisgrid.JointGrid at 0x1bfbe8c42b0>
../../../_images/0cd0216c05c505da166849ed299ca0de0c1f04317d1b6dafc34ae3702ce8e280.png
# 'hex'

sns.jointplot(x='total_bill',y='tip',data=tips,kind='hex')
<seaborn.axisgrid.JointGrid at 0x1bfbd546670>
../../../_images/b1cec573c3bd2cf09cc7c083b19c72fec0bb8200ca0f6ebd3d987515809c419f.png
# 'reg'

sns.jointplot(x='total_bill',y='tip',data=tips,kind='reg')
<seaborn.axisgrid.JointGrid at 0x1bfbeacc8b0>
../../../_images/b6d1c0aaa5a1b7a6a60bf6f68c5146569aa0bbad33a675805c15cb2c636ff19d.png

3 pairplot()#

pairplot() will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns).

sns.pairplot(tips)
<seaborn.axisgrid.PairGrid at 0x1bfbe8d2640>
../../../_images/427da1dd6aef5933448eb883240de74a5423518aaae1b8c568e71dffc8a91911.png
sns.pairplot(tips,hue='sex',palette='coolwarm')
<seaborn.axisgrid.PairGrid at 0x1bfbf65d610>
../../../_images/e1694d1be8d0c714cc32d9acb4e11f4583df54937cb950473da4ad0e7cd2ddae.png

4 rugplot()#

rugplots() are actually a very simple concept, they just draw a dash mark for every point on a univariate distribution. They are the building block of a KDE plot:

sns.rugplot(tips['total_bill'])
<AxesSubplot:xlabel='total_bill'>
../../../_images/59f987be149250f32ac62db2572427a0fca7541eb9b2e2f1c8f496e71b228fac.png

5 kdeplot()#

kdeplots() are Kernel Density Estimation plots. These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value. For example:

# Don't worry about understanding this code!
# It's just for the diagram below
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

#Create dataset
dataset = np.random.randn(25)

# Create another rugplot
sns.rugplot(dataset);

# Set up the x-axis for the plot
x_min = dataset.min() - 2
x_max = dataset.max() + 2

# 100 equally spaced points from x_min to x_max
x_axis = np.linspace(x_min,x_max,100)

# Set up the bandwidth, for info on this:
url = 'http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth'

bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2


# Create an empty kernel list
kernel_list = []

# Plot each basis function
for data_point in dataset:
    
    # Create a kernel for each point and append to list
    kernel = stats.norm(data_point,bandwidth).pdf(x_axis)
    kernel_list.append(kernel)
    
    #Scale for plotting
    kernel = kernel / kernel.max()
    kernel = kernel * .4
    plt.plot(x_axis,kernel,color = 'grey',alpha=0.5)

plt.ylim(0,1)
(0.0, 1.0)
../../../_images/05f9a1fb591f96147b3b3a691138848b0af97f83d8bab220441ac9ff00e1caa1.png
# To get the kde plot we can sum these basis functions.

# Plot the sum of the basis function
sum_of_kde = np.sum(kernel_list,axis=0)

# Plot figure
fig = plt.plot(x_axis,sum_of_kde,color='indianred')

# Add the initial rugplot
sns.rugplot(dataset,c = 'indianred')

# Get rid of y-tick marks
plt.yticks([])

# Set title
plt.suptitle("Sum of the Basis Functions")
Text(0.5, 0.98, 'Sum of the Basis Functions')
../../../_images/aa8e93d88f2126a941554946c2ccccc16bfc06c8d30262a1a311b80d5a2c0804.png
sns.kdeplot(tips['total_bill'])
sns.rugplot(tips['total_bill'])
<AxesSubplot:xlabel='total_bill', ylabel='Density'>
../../../_images/71f902b50aee3b0a57594696edc79219f855ab95b551a33fc51698c86969c2be.png
sns.kdeplot(tips['tip'])
sns.rugplot(tips['tip'])
<AxesSubplot:xlabel='tip', ylabel='Density'>
../../../_images/dd3a81373d3773ece50f68a36d747e1d75aba65c56f079d9ca93bb33b6fef9f3.png

Alright! Since we’ve finished with Distribution Plots in our next lecture where we shall be discussing few other plots which deal quite heavily with Categorical Data Plots, that is commonly seen across.