Loading Datasets in Seaborn

Loading Datasets in Seaborn#

When working with Seaborn, we can either use one of the built-in datasets that Seaborn offers or we can load a Pandas DataFrame. Seaborn is part of the PyData stack hence accepts Pandas data structures.

Let us begin by importing few built-in datasets but before that we shall import few other libraries as well that our Seaborn would depend upon:

# Importing intrinsic libraries:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Once we have imported the required libraries, now it is time to load built-in dataset. The dataset we would be dealing with in this illustration is Iris Flower Dataset.

# Loading built-in Datasets:
iris = sns.load_dataset("iris")

Similarly we may load other dataset as well and for illustration sake, I shall code few of them down here (though won’t be referencing to):

# Refer to 'Dataset Source Reference' for list of all built-in Seaborn datasets.
tips = sns.load_dataset("tips")
exercise = sns.load_dataset("exercise")
titanic = sns.load_dataset("titanic")
flights = sns.load_dataset("flights")

Let us take a sneak peek as to how this Iris dataset looks like and we shall be using Pandas to do so:

iris.head(10)

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa
6	4.6	3.4	1.4	0.3	setosa
7	5.0	3.4	1.5	0.2	setosa
8	4.4	2.9	1.4	0.2	setosa
9	4.9	3.1	1.5	0.1	setosa

Iris dataset actually has 50 samples from each of three species of Iris flower (Setosa, Virginica and Versicolor). Four features were measured (in centimetres) from each sample: Length and Width of the Sepals and Petals. Let us try to have a summarized view of this dataset:

iris.describe()

	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

.describe() is a very useful method in Pandas as it generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. Without getting in-depth into analysis here, let us try to plot something simple from this dataset:

sns.set()
%matplotlib inline
# Later in the course I shall explain why above 2 lines of code have been added.

sns.swarmplot(x="species", y="petal_length", data=iris)

C:\ProgramData\Anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 14.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

<AxesSubplot:xlabel='species', ylabel='petal_length'>

../../../_images/830da5b1806ce352bac565b467ac45faef469625ff1d3f607aa1b54f388cebdf.png

This beautiful representation of data we see above is known as a Swarm Plot with minimal parameters. I shall be covering this in detail later on but for now I just wanted you to have a feel of serenity we’re getting into.

Let us now try to load a random dataset and the one I’ve picked for this illustration is PoliceKillingsUS dataset. This dataset has been prepared by The Washington Post (they keep updating it on runtime) with every fatal shooting in the United States by a police officer in the line of duty since Jan. 1, 2015.

# Loading Pandas DataFrame:
df = pd.read_csv("datasets/PoliceKillingsUS.csv", encoding="windows-1252")

Just the way we looked into Iris Data set, let us know have a preview of this dataset as well. We won’t be getting into deep analysis of this dataset because our agenda is only to visualize the content within. So, let’s do this:

df.head(10)

	id	name	date	manner_of_death	armed	age	gender	race	city	state	signs_of_mental_illness	threat_level	flee	body_camera
0	3	Tim Elliot	02/01/15	shot	gun	53.0	M	A	Shelton	WA	True	attack	Not fleeing	False
1	4	Lewis Lee Lembke	02/01/15	shot	gun	47.0	M	W	Aloha	OR	False	attack	Not fleeing	False
2	5	John Paul Quintero	03/01/15	shot and Tasered	unarmed	23.0	M	H	Wichita	KS	False	other	Not fleeing	False
3	8	Matthew Hoffman	04/01/15	shot	toy weapon	32.0	M	W	San Francisco	CA	True	attack	Not fleeing	False
4	9	Michael Rodriguez	04/01/15	shot	nail gun	39.0	M	H	Evans	CO	False	attack	Not fleeing	False
5	11	Kenneth Joe Brown	04/01/15	shot	gun	18.0	M	W	Guthrie	OK	False	attack	Not fleeing	False
6	13	Kenneth Arnold Buck	05/01/15	shot	gun	22.0	M	H	Chandler	AZ	False	attack	Car	False
7	15	Brock Nichols	06/01/15	shot	gun	35.0	M	W	Assaria	KS	False	attack	Not fleeing	False
8	16	Autumn Steele	06/01/15	shot	unarmed	34.0	F	W	Burlington	IA	False	other	Not fleeing	True
9	17	Leslie Sapp III	06/01/15	shot	toy weapon	47.0	M	B	Knoxville	PA	False	attack	Not fleeing	False

This dataset is pretty self-descriptive and has limited number of features (may read as columns).

race: W: White, non-Hispanic B: Black, non-Hispanic A: Asian N: Native American H: Hispanic O: Other None: unknown

And, gender indicates: M: Male F: Female None: unknown The threat_level column include incidents where officers or others were shot at, threatened with a gun, attacked with other weapons or physical force, etc. The attack category is meant to flag the highest level of threat. The other and undetermined categories represent all remaining cases. Other includes many incidents where officers or others faced significant threats.

The threat column and the fleeing column are not necessarily related. Also, attacks represent a status immediately before fatal shots by police; while fleeing could begin slightly earlier and involve a chase. Lately, body_camera indicates if an officer was wearing a body camera and it may have recorded some portion of the incident.

Let us now look into the descriptive statistics:

df.describe()

	id	age
count	2535.000000	2458.000000
mean	1445.731755	36.605370
std	794.259490	13.030774
min	3.000000	6.000000
25%	768.500000	26.000000
50%	1453.000000	34.000000
75%	2126.500000	45.000000
max	2822.000000	91.000000

These stats in particular do not really make much sense. Instead let us try to visualize age of people who were claimed to be armed as per this dataset.

Note: Two special lines of code that we added earlier won’t be required again. As promised, I shall reason that in upcoming lectures.

sns.stripplot(x="armed", y="age", data=df)

<AxesSubplot:xlabel='armed', ylabel='age'>

../../../_images/c255c05b7e147ace6c66217384e91072b4a8c887041b73c5eac0cf5f00cfd44d.png

As you would have guessed by now, this plot is known as a Strip plot and pretty ideal for categorical values. Even this shall be dealt in length later on.

I hope these sample plots have intrigued you enough to dive deeper into statistical visual inference with Seaborn. And in next lecture, we shall learn to Control Aesthetics of our plot and few other important aspects.