Classification

Classification is a statistical method of developing predicative models to separate and classify new observations. In machine learning, classification algorithms follow supervised learning, where models learn from training data. Subsequently, the models are used to categorize test data. Models usually calculate the probability of obtaining an outcome. Probability then translated into a category either by specifying a cutoff, say, >= 0.5 or by selecting the maximum of the probabilities. In machine learning, there’re many popular examples of classification problems like “spam – not spam”, “malignant – not malignant”, “purchased – not purchased”, “defective – not defective”, “survived – not survived”, “win – lose” etc. In this section, we’ll discuss classification algorithms of Logistic Regression, Support Vector Machine, Naive Bayes, Decision Tree and Random Forest.

To develop classification models and export output data, open mlearn app using the link attached.

Logistic Regression 

There’re three types of logistic regression – binary, ordinal and multinomial.

Binary logistic regression models a binary dependent variable using logistic function that produces an S-shaped curve. It provides outputs between 0 and 1 as for all values of predictors. The function is as follows.

f(x)=eβ0+β1x/1+eβ0+β1x

This can be expressed in terms of odds ratio or simply odds, the ratio of the probability of an event occurring to the event not occurring.

f(x)/1f(x)=eβ0+β1x

This can further be expressed in terms of logit link function by taking natural log of the odds ratio.

ln(f(x)/1f(x))=β0+β1x

linear and logit functions can be represented graphically, as follows.

This is the concept behind binary logistic regression that applies to dichotomous response variable.

Multinomial logistic regression applies to response variable with more than two nominal outcomes. In this method the logit or the natural logarithm of odds are modeled as a linear combination of the predictor variables. If number of predictors is n, there are n(n − 1)/2 logits instead of just 1 in case of binary regression. Among these only (n-1) are non-redundant.

Ordinal logistic regression or proportional odds model is used when there’s a natural order in levels of response variable. It is an extension of binary logistic regression that uses proportional odds assumption or parallel regression assumption.

Current example is provided by UCLA Institute for Digital Research and Education. Data is downloaded from the link below:

https://stats.idre.ucla.edu/stat/data/binary.csv

We’re interested to determine how GRE, GPA and rank of undergraduate institution influence admission into graduate school. The response variable admit, 0 and 1, stands for no admission and admission.

To develop the model, open mlearn app. Load the data clicking ‘Data’ tab, select ‘url’ under ‘Data Source’, copy/paste above url. Under ‘File Ext’, select ‘.csv .txt..’ under ‘file type’ and click ‘Submit Change’. This will load the data in app. First 10 rows and 10 columns will be shown in ‘Data Input’ with summary statistic in ‘Summary’. Complete dataset can be viewed and edited in ‘Edit’ tab.

In ‘Edit’ tab, select ‘admit’ and ‘rank’ in ‘change to discrete’ under ‘Change Data’. In ‘Summary’ section of this tab make sure ‘data_type’ are integer or numeric. In this example, select ‘train/test split (%)’ as ‘100/0’, meaning whole dataset is used as a training set. Click ‘Submit Change’ to effect this change.

‘Distribution’ tab can be used to view the distribution of the data. Following box plots shows higher average ‘gre’ and ‘gpa’ are associated with admission (admit=1).

This can be obtained by selecting ‘Box’ in ‘plot type’, ‘admit’ as ‘x variable’, ‘gre’ and ‘gpa’ as ‘y variable’, ‘y_variables’ as ‘facet variable’ and ‘admit’ as ‘color variable’.

 

To develop classification model, click ‘Classification’ tab, select ‘admit’ in response, ‘gre’, ‘gpa’ and ‘rank’ in ‘predictors’ and ‘logistic regression’ in select model under ‘Variables’. Check off ‘Scale Data’ and click ‘Submit Change’.

  

Summary of the model will be displayed in ‘Summary’ section:

In above output, ‘Call’ shows the formula used to develop this model. Following is a description of Residual Deviance, Null Deviance and Deviance Residuals. Here LL represents log likelihood.

Residual Deviance = 2 X (LL(Saturated Model) – LL(Proposed Model))

Null Deviance = 2 X (LL(Saturated Model) – LL(Null Model))

Multiplied with 2 makes the difference in LL have Chi-squared distribution with degrees of freedom equal to the difference in number of parameters. In this example, number of parameters in Saturated model is 400 and number of parameters in Null model 1, so degrees of freedom for Null deviance is 400 -1 = 399.

Deviance residuals are analogous to the residuals in ordinary least squares (OLS) model. This is the square roots of individual terms in residual deviance. So, squaring and summing all Deviance Residuals we will obtain the Residual Deviance. The goal is to obtain a distribution that is centered around zero and symmetric in both sides.

Following portion of summary output shows the model coefficients, standard errors, the z-value and the p-values. As shown by the p-values, all model parameters ‘gre’, ‘gpa’ and ‘rank’ are statistically significant. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable. For example, one unit change in ‘gre’, log odds of admission increases 0.002264 and for ‘gpa’ it increases by 0.804038. Attending an undergraduate institution with rank 2, versus an institution with rank 1, changes the log odds of admission by -0.675443. It gradually decreases admission probability as rank increases (considering rank 1 being the best and so on).

AIC or Akaike Information Criterion can be used to compare the goodness of fit between the models. It is a maximum likelihood estimate which penalizes to prevent overfitting similar to adjusted R2 in multiple linear regression. The goal is to prevent using irrelevant predictor variables. So, lower AIC of a model is better than that of higher AIC.

End of summary is the Number of Fisher Scoring Iterations. Iterative approach is required to solve using algorithm as Newton-Raphson. The last line tells how many iterations were completed before the process stopped and output the results.

To visualize data, select ‘are’ in ‘x variable’, ‘probability_1’ (admit=1) in ‘y variable’ and ‘rank’ in ‘color variable’. Click ‘Submit Change’.

This shows the probability of getting admitted. As GRE score increases or rank number decreases, it increases the probability of admission.

Support Vector Machine

Based on the principles of statistics, optimization, and machine learning, Support Vector Machine (SVM) was proposed by Boser, Guyon, and Vapnik in 1992. SVM is a powerful machine learning technique that classifies data using optimal hyperplane.

SVM formulates binary classification problem as convex optimization problem to find the maximum margin separating the hyperplane. Optimal hyperplane is represented by Support Vectors, as shown below:

In case of data that are not linearly separable, SVM maps the data into higher dimensional space where it is linearly separable. Then the algorithm determines the hyperplane and finally project the data back to original lower dimensional space.

Mapping large amount of data into a higher dimensional space can be Compute-intensive. ‘Kernel trick’ is used to reduce is computational complexity and time. In this trick, Kernel function is used instead of mapping function. The Kernel functions available in this app are Linear, Polynomial, Radial (Basis Function, RBF) and Sigmoid. To apply this model, simply select ‘Support Vector Machine’ under ‘select model’, classification type (default C-classification) and Kernel. The parameter range for C is from 0 to infinity and nu is always between 0 and 1. A property of nu is that it is related to the ratio of support vectors and the ratio of the training error.

Naive Bayes 

Bayes theorem is used to determine the probability of an event based on prior knowledge of relevant features. Naive Bayes classifier calculates the probability of a data point being included into a class given its features or predictor variables. The word ‘Naive’ reflects assumption of independence between features.

In formula above, P(class) is calculated using number of observations of a certain class and total number of observations. P(feature) is calculated using a hypothetical space assuming single feature and counting observations. Similarly, P(feature|class) is determined by counting number of observations of a certain class in hypothetical space and total number of observations belongs to that class. To apply this model, select ‘Naive Bayes’ under ‘select model’, click ‘Submit Change’.

Decision Tree 

Decision tree regression is performed by splitting the dataset into multiple sections as shown figure below. The algorithm uses Entropy and Information Gain to construct the decision tree. In this figure X and Y are the predictor variables. Predicted classes are determined by the classes of the terminal leaf or leaf node. For example, if X is less than X1 and Y is less than Y2 then terminal leaf is S-3. Predicted class of any data point within this terminal leaf will be C2.

To apply this model, select ‘Decision Tree’ under ‘select model’, click ‘Submit Change’.

Random Forest Regression 

Random forest, proposed by Breiman in 2001, is an ensemble learning approach for classification and regression. It uses a large collection of uncorrelated decision trees called the random forest. Instead of developing a solution based on the output of a single decision tree, random forest aggregates the outputs from a number of trees. This is to reduce high variance and high bias of a single decision tree.

The algorithm first selects a random sample of data points from training set. Then it builds a decision tree with the sample data. Based on number of trees (ntree) in the model, the algorithm repeats first two steps to build those decision trees. Finally, to predict the class of test dataset, it obtains predictions from each tree and assigns classes that are predicted by majority of the trees.

To apply this model, select ‘Random Forest’ under ‘select model’, select number of trees in ‘ntree’ and click ‘Submit Change’.

Download Data

To download output data, Click ‘Download Data’ under ‘Data Classification’. This will load the output in local drive. It includes the probability of not-admit as probability_0, probability of admit as probability_1 and final prediction to admit as predicted_admit.