{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is linear classification?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Linear classification** is the task of finding a linear function that best separates a series of differently classified points in euclidean space. The linear function is called a **linear separator**. Each point can be interpreted as an **example**, and each dimension can be interpreted as a **feature**. If the space has 2 dimensions, the linear regression is **univariate** and the linear separator is a **straight line**. If the space has more than 2 dimensions, the linear regression is **multivariate** and the linear separator is a **hyperplane**. If the linear classification classifies examples into two different classes, the classification is **binary**. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linear classification vs. linear regression."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Linear classification and linear regression are similar in their approach and data\n",
"representation. However, they solve two different problems. **Linear regression** is the task of finding a linear function that *best approximates* a series of points. The example classification is nothing more than another dimension to a linear regressor. In contrast, a linear classifier treats the example classification not as a dimension, but in a special way that the following code demonstrates."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementing and using linear classification."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classifying the survival chances of Titanic passengers."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following code uses multivariate linear binary classification to classify the survival of passengers of the ship Titanic. The input data is taken from the [Kaggle Titanic](https://www.kaggle.com/c/titanic) competition."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a first step, we import `os.path` to locate our dataset, `pandas` to manipulate the dataset as tabular data, `numpy` to efficiently process our data arrays and `matplotlib.pyplot` to display the results of the linear classification in a graph. We disable warnings to keep the output tidy."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"hide_input": false
},
"outputs": [],
"source": [
"from os import path\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reading the input data."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"datadir = path.abspath(path.expanduser('~/datasets/titanic'))\n",
"rawexamples = pd.read_csv(path.join(datadir, 'train.csv'))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund, Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings, Mrs. John Bradley (Florence Briggs Th... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen, Miss. Laina | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.9250 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen, Mr. William Henry | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.0500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rawexamples.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"891"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(rawexamples)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our input data `rawexamples` consists of 891 rows. Each row consists of 11 columns that contain information about a passenger. We interpret each row as an *example* and each column as a *feature*. The feature *Survived* classifies each example as either *survived* (1) or *died* (0) and is thus a binary classification. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Cleaning the input data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our input data has many features of presumably low importance. To simplify and speed up the binary classifier, we limit our examples to the three presumably most important features: *Pclass* (the passenger's class), *Sex* and *Age*."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Pclass | \n",
" Sex | \n",
" Age | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 22.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 38.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" female | \n",
" 26.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" female | \n",
" 35.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 3 | \n",
" male | \n",
" 35.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Pclass Sex Age\n",
"0 3 male 22.0\n",
"1 1 female 38.0\n",
"2 3 female 26.0\n",
"3 1 female 35.0\n",
"4 3 male 35.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"examples = rawexamples[['Pclass', 'Sex', 'Age']]\n",
"examples.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we assign a neutral value to all missing values in the input data."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"examples = examples.fillna(0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Additionally, we convert the `Sex` feature to a numerical scale, because our linear classifier takes only numerical input."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"examples['Sex'] = examples['Sex'].map({'male': 0, 'female': 1})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Then, we obtain the classifications."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"classifications = rawexamples[['Survived']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we split our examples into a training set and a validation set and convert them to arrays."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"training_examples = examples.values[:800]\n",
"training_classifications = classifications.values[:800]\n",
"\n",
"validation_examples = examples.values[801:]\n",
"validation_classifications = classifications.values[801:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Implementing the linear classifier."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"class LinearBinaryClassifier:\n",
" def __init__(self, features):\n",
" self.weights = np.zeros(features)\n",
" self.time = 0\n",
" def train(self, examples, classifications):\n",
" for example, classification in zip(examples, classifications):\n",
" self.time += 1\n",
" prediction = self.classify(example)\n",
" for idx, value in enumerate(example):\n",
" weight = self.weights[idx]\n",
" annealing = 1000/(1000 + self.time*10) # Values chosen by experience.\n",
" self.weights[idx] += annealing * (classification - prediction) * value\n",
" def classify(self, example):\n",
" if np.dot(example, self.weights) > 0:\n",
" return 1\n",
" else:\n",
" return 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above linear binary classifier supports two actions, `train()` and `classify()`. `train()` takes a list of examples and their classifications. It then approximates a linear function $$classify(example) = w_1 example_1 + w_2 example_2 + ... + w_n example_n$$ where $example_n$ refers to the value of $example$ for the n-th feature. $\\textbf{w}$ is a vector of weights that defines the function $classify$. The output of $classify$ is interpreted in this way:\n",
"* value less than 0: classify as 0.\n",
"* value greater than 0 : classify as 1.\n",
"* value equal to 0: undefined (the algorithm is \"uncertain\").\n",
"\n",
"The approximation is done by iterating over all examples and updating the weights if an example is misclassified. Specifically, each weight is increased or decreased so that the numerical output of `classify` gets closer to the the correct output range. \n",
"\n",
"To ensure that the algorithm converges at an approximately optimal solution, we use [simulated annealing](https://en.wikipedia.org/wiki/Simulated_annealing) to gradually lower the learning rate."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using the linear classifier."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To test and use our linear classifier, we need to train it on the example set and than validate it with the validation set. The validation step tells us the accuracy of the linear classifier.\n",
"\n",
"Since our example set has only 800 examples, it is small for machine learning standards. Therefore, we will train our linear classifier multiple times with the same examples. After each training iteration, we test our classifier by classifying all examples from the validation set and comparing the output with the expected result. It is important for the validation step that *no* data of the validation data is used for improving the classifier, since this would invalidate subsequent accuracy reports. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"classifier = LinearBinaryClassifier(features = len(training_examples[0]))\n",
"training_iterations = 100\n",
"accuracy = np.zeros(training_iterations)\n",
"for i in range(0,training_iterations):\n",
" classifier.train(training_examples, training_classifications)\n",
" correct = 0\n",
" for example, classification in zip(validation_examples, validation_classifications):\n",
" prediction = classifier.classify(example)\n",
" correct += 1 if prediction == classification else 0\n",
" accuracy[i] = correct/len(validation_examples)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we plot the reported accuracy of our linear classifier after each iteration."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5,0,'Iterations')"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.plot(accuracy)\n",
"plt.ylabel(\"Accuracy\")\n",
"plt.xlabel(\"Iterations\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The graph shows that our linear classifier has an accuracy of $≈78\\%$. This means that when asked to predict the odds of survival of a Titanic passenger, it is correct $≈78\\%$ of the time."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
},
"nikola": {
"category": "",
"date": "2018-03-28 07:42:40 UTC+02:00",
"description": "",
"link": "",
"slug": "multivariate_linear_binary_classification",
"tags": "",
"title": "Multivariate linear binary classification.",
"type": "text"
}
},
"nbformat": 4,
"nbformat_minor": 2
}