3 min read

Introduction to Classification

| Category: Statistics | Tags: e-ITEC, ISEC | 2 comments

Classification is one of the most basic problems in Statistics and Machine Learning. It is an example of a supervised learning problem where one deals with labeled data. In this note, we will learn about basic classification algorithms such as

  1. Logistic regression
  2. Linear and quadratic discriminant analysis
  3. Naive Bayes
  4. Multinomial logistic regression
  5. \(k\)-nearest neighbours

An example

FrontBack

Image source: Wikipedia

These are pictures of the second series of Swiss 1000-franc notes (216 mm \(\times\) 131 mm) issued between 1911 and 1914. You are presented with a bunch of these notes some of which are counterfeit. The potential giveaways are the various dimensions of the banknotes:

\(X_1\)\(X_2\)\(X_3\)\(X_4\)\(X_5\)\(X_6\)
LengthLeftRightBottomTopDiagonal

Dimensions

Image taken from Flury and Riedwyl (1988)

Thankfully, for the given collection, we know which of the notes as genuine and which of them are counterfeit. Thus we have data that look like this:

StatusLengthLeftRightBottomTopDiagonal
1genuine214.8131.0131.19.09.7141.0
2genuine214.6129.7129.78.19.5141.7
3genuine214.8129.7129.78.79.6142.2
101counterfeit214.4130.1130.39.711.7139.8
102counterfeit214.9130.5130.211.011.5139.5
103counterfeit214.9130.3130.18.711.7140.2

Question

Can one devise some procedure that would look at the dimensions of a new note and predict whether it is genuine or counterfeit?

Let us look at a picture.

Pairwise scatterplot

What do you think?

Classification: the basic set-up

Classification deals with labeled data: \[ (\underbrace{Y}_{\text{label}}, \underbrace{X_1, \ldots, X_p}_{\text{features}}). \]

The goal is to learn/infer possible relationships between \(Y\) and the \(X_i\)’s.

In the banknotes example, \[ \begin{aligned} Y &\in \{\text{genuine}, \text{counterfeit}\}, \\ X_1 &= \text{Length}, \\ &\vdots \\ X_6 &= \text{Diagonal}. \end{aligned} \]

Note: When there are only two labels, we call the problem binary classification. In this case, we typically encode the labels by \(0\) and \(1\).

In the banknotes example, let us use the encoding \[ \begin{aligned} 1 &= \text{genuine}, \\ 0 &= \text{counterfeit}. \end{aligned} \]

We denote the label of the \(i\)-th observation by \(Y_i\), and the features by \(X_{i1}, \ldots, X_{ip}\).

Logistic regression

This posits the following model for data generation: \[ \mathrm{Pr}(Y = 1 | X_1, \ldots, X_p) = p(X_1, \ldots, X_p) := \frac{\exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}{1 + \exp(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)}. \] Assuming this model, we would like to find values of the parameters \(\beta_i\) from the given data.

One does this using the method of maximum likelihood. \[ L(\beta_0, \ldots, \beta_p) = \prod_{i = 1}^n p(X_{i1}, \ldots, X_{ip})^{Y_i} (1 - p(X_{i1}, \ldots, X_{ip}))^{1 - Y_i}. \]

One maximizes this likelihood function with respect to the parameters \(\beta_0, \ldots, \beta_p\).

Unfortunately, there is no closed form solution unlike linear regression. However, we can use iterative methods (e.g., the Fisher scoring method) to obtain a solution.

Fortunately, someone has already done this for us in R.

(To be continued…)

2 comments

Leave a comment

[Your email address will not be published. Required fields are marked *. You can, of course, comment anonymously.]