Loading Datasets in Seaborn

Loading Datasets in Seaborn#

When working with Seaborn, we can either use one of the built-in datasets that Seaborn offers or we can load a Pandas DataFrame. Seaborn is part of the PyData stack hence accepts Pandas data structures.

Let us begin by importing few built-in datasets but before that we shall import few other libraries as well that our Seaborn would depend upon:

# Importing intrinsic libraries:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Once we have imported the required libraries, now it is time to load built-in dataset. The dataset we would be dealing with in this illustration is Iris Flower Dataset.

# Loading built-in Datasets:
iris = sns.load_dataset("iris")

Similarly we may load other dataset as well and for illustration sake, I shall code few of them down here (though won’t be referencing to):

# Refer to 'Dataset Source Reference' for list of all built-in Seaborn datasets.
tips = sns.load_dataset("tips")
exercise = sns.load_dataset("exercise")
titanic = sns.load_dataset("titanic")
flights = sns.load_dataset("flights")

Let us take a sneak peek as to how this Iris dataset looks like and we shall be using Pandas to do so:

iris.head(10)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa

Iris dataset actually has 50 samples from each of three species of Iris flower (Setosa, Virginica and Versicolor). Four features were measured (in centimetres) from each sample: Length and Width of the Sepals and Petals. Let us try to have a summarized view of this dataset:

iris.describe()
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

.describe() is a very useful method in Pandas as it generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. Without getting in-depth into analysis here, let us try to plot something simple from this dataset:

sns.set()
%matplotlib inline
# Later in the course I shall explain why above 2 lines of code have been added.

sns.swarmplot(x="species", y="petal_length", data=iris)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 14.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
<AxesSubplot:xlabel='species', ylabel='petal_length'>
../../../_images/830da5b1806ce352bac565b467ac45faef469625ff1d3f607aa1b54f388cebdf.png

This beautiful representation of data we see above is known as a Swarm Plot with minimal parameters. I shall be covering this in detail later on but for now I just wanted you to have a feel of serenity we’re getting into.

Let us now try to load a random dataset and the one I’ve picked for this illustration is PoliceKillingsUS dataset. This dataset has been prepared by The Washington Post (they keep updating it on runtime) with every fatal shooting in the United States by a police officer in the line of duty since Jan. 1, 2015.

# Loading Pandas DataFrame:
df = pd.read_csv("datasets/PoliceKillingsUS.csv", encoding="windows-1252")

Just the way we looked into Iris Data set, let us know have a preview of this dataset as well. We won’t be getting into deep analysis of this dataset because our agenda is only to visualize the content within. So, let’s do this:

df.head(10)
id name date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera
0 3 Tim Elliot 02/01/15 shot gun 53.0 M A Shelton WA True attack Not fleeing False
1 4 Lewis Lee Lembke 02/01/15 shot gun 47.0 M W Aloha OR False attack Not fleeing False
2 5 John Paul Quintero 03/01/15 shot and Tasered unarmed 23.0 M H Wichita KS False other Not fleeing False
3 8 Matthew Hoffman 04/01/15 shot toy weapon 32.0 M W San Francisco CA True attack Not fleeing False
4 9 Michael Rodriguez 04/01/15 shot nail gun 39.0 M H Evans CO False attack Not fleeing False
5 11 Kenneth Joe Brown 04/01/15 shot gun 18.0 M W Guthrie OK False attack Not fleeing False
6 13 Kenneth Arnold Buck 05/01/15 shot gun 22.0 M H Chandler AZ False attack Car False
7 15 Brock Nichols 06/01/15 shot gun 35.0 M W Assaria KS False attack Not fleeing False
8 16 Autumn Steele 06/01/15 shot unarmed 34.0 F W Burlington IA False other Not fleeing True
9 17 Leslie Sapp III 06/01/15 shot toy weapon 47.0 M B Knoxville PA False attack Not fleeing False

This dataset is pretty self-descriptive and has limited number of features (may read as columns).

race: W: White, non-Hispanic B: Black, non-Hispanic A: Asian N: Native American H: Hispanic O: Other None: unknown

And, gender indicates: M: Male F: Female None: unknown The threat_level column include incidents where officers or others were shot at, threatened with a gun, attacked with other weapons or physical force, etc. The attack category is meant to flag the highest level of threat. The other and undetermined categories represent all remaining cases. Other includes many incidents where officers or others faced significant threats.

The threat column and the fleeing column are not necessarily related. Also, attacks represent a status immediately before fatal shots by police; while fleeing could begin slightly earlier and involve a chase. Lately, body_camera indicates if an officer was wearing a body camera and it may have recorded some portion of the incident.

Let us now look into the descriptive statistics:

df.describe()
id age
count 2535.000000 2458.000000
mean 1445.731755 36.605370
std 794.259490 13.030774
min 3.000000 6.000000
25% 768.500000 26.000000
50% 1453.000000 34.000000
75% 2126.500000 45.000000
max 2822.000000 91.000000

These stats in particular do not really make much sense. Instead let us try to visualize age of people who were claimed to be armed as per this dataset.

Note: Two special lines of code that we added earlier won’t be required again. As promised, I shall reason that in upcoming lectures.

sns.stripplot(x="armed", y="age", data=df)
<AxesSubplot:xlabel='armed', ylabel='age'>
../../../_images/c255c05b7e147ace6c66217384e91072b4a8c887041b73c5eac0cf5f00cfd44d.png

As you would have guessed by now, this plot is known as a Strip plot and pretty ideal for categorical values. Even this shall be dealt in length later on.

I hope these sample plots have intrigued you enough to dive deeper into statistical visual inference with Seaborn. And in next lecture, we shall learn to Control Aesthetics of our plot and few other important aspects.