{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# K-Means Algorithm Demo\n", "\n", "\n", "\n", "> ☝Before moving on with this demo you might want to take a look at:\n", "> - 📗[Math behind the K-Means Algorithm](../../k-means.md)\n", "\n", "\n", "**K-means clustering** aims to partition _n_ observations into _K_ clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.\n", "\n", "> **Demo Project:** In this example we will try to cluster Iris flowers into tree categories that we don't know in advance based on `petal_length` and `petal_width` parameters using K-Means unsupervised learning algorithm." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# To make debugging of logistic_regression module easier we enable imported modules autoreloading feature.\n", "# By doing this you may change the code of logistic_regression library and all these changes will be available here.\n", "%load_ext autoreload\n", "%autoreload 2\n", "\n", "# Add project root folder to module loading paths.\n", "import sys\n", "sys.path.append('..')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Dependencies\n", "\n", "- [pandas](https://pandas.pydata.org/) - library that we will use for loading and displaying the data in a table\n", "- [numpy](http://www.numpy.org/) - library that we will use for linear algebra operations\n", "- [matplotlib](https://matplotlib.org/) - library that we will use for plotting the data\n", "- [k_means](https://github.com/trekhleb/homemade-machine-learning/blob/master/homemade/k_means/k_means.py) - custom implementation of K-Means algorithm" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import 3rd party dependencies.\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# Import custom k-means implementation.\n", "from utils.k_means import KMeans" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load the Data\n", "\n", "In this demo we will use [Iris data set](http://archive.ics.uci.edu/ml/datasets/Iris).\n", "\n", "The data set consists of several samples from each of three species of Iris (`Iris setosa`, `Iris virginica` and `Iris versicolor`). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, [Ronald Fisher](https://en.wikipedia.org/wiki/Iris_flower_data_set) developed a linear discriminant model to distinguish the species from each other." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | sepal_length | \n", "sepal_width | \n", "petal_length | \n", "petal_width | \n", "class | \n", "
---|---|---|---|---|---|
0 | \n", "5.1 | \n", "3.5 | \n", "1.4 | \n", "0.2 | \n", "SETOSA | \n", "
1 | \n", "4.9 | \n", "3.0 | \n", "1.4 | \n", "0.2 | \n", "SETOSA | \n", "
2 | \n", "4.7 | \n", "3.2 | \n", "1.3 | \n", "0.2 | \n", "SETOSA | \n", "
3 | \n", "4.6 | \n", "3.1 | \n", "1.5 | \n", "0.2 | \n", "SETOSA | \n", "
4 | \n", "5.0 | \n", "3.6 | \n", "1.4 | \n", "0.2 | \n", "SETOSA | \n", "
5 | \n", "5.4 | \n", "3.9 | \n", "1.7 | \n", "0.4 | \n", "SETOSA | \n", "
6 | \n", "4.6 | \n", "3.4 | \n", "1.4 | \n", "0.3 | \n", "SETOSA | \n", "
7 | \n", "5.0 | \n", "3.4 | \n", "1.5 | \n", "0.2 | \n", "SETOSA | \n", "
8 | \n", "4.4 | \n", "2.9 | \n", "1.4 | \n", "0.2 | \n", "SETOSA | \n", "
9 | \n", "4.9 | \n", "3.1 | \n", "1.5 | \n", "0.1 | \n", "SETOSA | \n", "