A big part of machine learning consist of the detection of patterns in data. A dataset is represented as a collection of features. For example a dataset about fruits could contain features about color, volume, weight and so on. By detecting the patterns in the features, objects in the dataset can be classified as belonging to a certain class. For example in the fruits dataset each fruit object can be classified as apple, banana, peer and so on.
Object classification is usually done based on recognising the patterns in a labled dataset. By making the assumption that objects within the same class share a similar patter in their features, it is possible to generalise new unseen objects. This is also called Supervised learning, which this blog will be moslty focused on.
Iris classification example
Practical examples are really helpful to go from a theoretical understanding to actually being able to implement some of the material you learned. Therefore I will be going through most of the theorie with a small example dataset. The dataset I will be using is called Fishers' Iris data set, which was introduced in Fisher's paper in 1936 and is one of the first published instances of machine learning. It is a rather small dataset consisting of 3 different flower types and 50 examples of each flower type. The 3 classes are Iris-Setosa, Iris-Versicolour, Iris-Virginica. And the 4 features are sepal length, sepal width, petal length, petal width all in cm.
Generative learning algorithms
In order to classify an unknown clothing piece $x$ as class $y_i$ (T-shirt/Top, Trouser, Pullover, Sneaker โฆ) we need to know what the probability is that $x$ is of class $y_i$. This means we need to know all the:
Once we have all the posterior probabilities we can predict (classify) the class of a given clothing piece $x$ by maximizing the posterior probability. The same principle holds for a regression problem, where $y$ will be continuous.
Generative learning algorithms try to model the underlying Class conditional distribution of each class. This distribution can intuitively be thought of as the distribution of which samples of the corresponding class are generated from.
To make a prediction however you need the posterior probabilities, therefore Bayes' theorem is used ,which is further explained below.
Bayes' theorem
Class (conditional) distribution p(x|y): To find the distribution of objects within a class based on examples, we will have to estimate a PDF (probability density function). One way of doing this is via Parametric density estimation.
Class prior p(y): The class prior represent the probability that an object is of class $y_i$, regardless of what the value of the object is. Given you have a large enough and Unbiased sample space this should be easily attainable.
Data distribution p(x): The data distribution tells you what the probability is that an object has value $x$, regardless of it's class. Once both the Class priors and the Class (conditional) distributions of a certain dataset have been found, the data distribution can be calculated by summing the class priors with the distributions.
Discriminative learning algorithms
Up until now we have seen a number of different algorithms which estimate the class conditional distribution and use Bayes' rule to compute the posterior probabilities needed for prediction.
These types of algorithms are called generative algorithms because they try to learn the model that generates the data behind the scenes.
Discriminative algorithms model the posterior probabilities from training examples directly. It makes fewer assumptions on the distributions but depends heavily on the quality of the data.
Introduction
Softmax
Support Vector Machines
Introduction
Functional and geometric margins
To gain some intuition about how SVM work and how they are different from the before seen generalized linear models, we will first take a look at functional and geometric margins. The main idea behind the 'margin' is to maximize the distance between classes using your decision boundary.
For example given we are using a linear (hyper)plane as our decision boundary, we want to maximize the distance to the closest training example from our decision plane: