Classification is one of the most basic problems in Statistics and Machine Learning. It is an example of a supervised learning problem where one deals with labeled data. In this note, we will learn about basic classification algorithms such as

Logistic regression
Linear and quadratic discriminant analysis
Naive Bayes
Multinomial logistic regression
\(k\)-nearest neighbours

An example

Image source: Wikipedia

These are pictures of the second series of Swiss 1000-franc notes (216 mm \(\times\) 131 mm) issued between 1911 and 1914. You are presented with a bunch of these notes some of which are counterfeit. The potential giveaways are the various dimensions of the banknotes:

\(X_1\)	\(X_2\)	\(X_3\)	\(X_4\)	\(X_5\)	\(X_6\)
Length	Left	Right	Bottom	Top	Diagonal

Image taken from Flury and Riedwyl (1988)

Thankfully, for the given collection, we know which of the notes as genuine and which of them are counterfeit. Thus we have data that look like this:

	Status	Length	Left	Right	Bottom	Top	Diagonal
1	genuine	214.8	131.0	131.1	9.0	9.7	141.0
2	genuine	214.6	129.7	129.7	8.1	9.5	141.7
3	genuine	214.8	129.7	129.7	8.7	9.6	142.2
101	counterfeit	214.4	130.1	130.3	9.7	11.7	139.8
102	counterfeit	214.9	130.5	130.2	11.0	11.5	139.5
103	counterfeit	214.9	130.3	130.1	8.7	11.7	140.2

Question
Can one devise some procedure that would look at the dimensions of a new note and predict whether it is genuine or counterfeit?

Let us look at a picture.

What do you think?

Classification: the basic set-up

Classification deals with labeled data: \[ (\underbrace{Y}_{\text{label}}, \underbrace{X_1, \ldots, X_p}_{\text{features}}). \]

The goal is to learn/infer possible relationships between \(Y\) and the \(X_i\)’s.

In the banknotes example, \[ \begin{aligned} Y &\in \{\text{genuine}, \text{counterfeit}\}, \\ X_1 &= \text{Length}, \\ &\vdots \\ X_6 &= \text{Diagonal}. \end{aligned} \]

Note: When there are only two labels, we call the problem binary classification. In this case, we typically encode the labels by \(0\) and \(1\).

In the banknotes example, let us use the encoding \[ \begin{aligned} 1 &= \text{genuine}, \\ 0 &= \text{counterfeit}. \end{aligned} \]

We denote the label of the \(i\)-th observation by \(Y_i\), and the features by \(X_{i1}, \ldots, X_{ip}\).

Logistic regression

This posits the following model for data generation: \[ \mathrm{Pr}(Y = 1 | X_1, \ldots, X_p) = p(X_1, \ldots, X_p) := \frac{\exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}{1 + \exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}. \] Assuming this model, we would like to find values of the parameters \(\beta_i\) from the given data.

One does this using the method of maximum likelihood. \[ L(\beta_0, \ldots, \beta_p) = \prod_{i = 1}^n p(X_{i1}, \ldots, X_{ip})^{Y_i} (1 - p(X_{i1}, \ldots, X_{ip}))^{1 - Y_i}. \]

One maximizes this likelihood function with respect to the parameters \(\beta_0, \ldots, \beta_p\).

Unfortunately, there is no closed form solution unlike linear regression. However, we can use iterative methods (e.g., the Fisher scoring method) to obtain a solution.

Fortunately, someone has already done this for us in R.

(To be continued…)

2 comments

Meow Meow September 16, 2021 at 1:58 pm UTC

Can you elaborate on how exactly Fischer’s scoring is done and how about continuing this series, when can we expect the next post? Would love to see some R codes too.

Reply to this comment

Soumendu Sundar Mukherjee Meow Meow November 1, 2021 at 6:53 pm UTC

Hi, thanks for your interest. I do want to continue this series, but at the moment I am terribly busy. Anyway, you may want to keep an eye on the Multivariate Analysis course I am teaching this semester. Classification will be covered in detail with R demos.

[Your email address will not be published. Required fields are marked *. You can, of course, comment anonymously.]

Introduction to Classification