Regression

Regression is used to predict the value of a response variable (Y-variable) based on one or more predictor variables (X-variables). In this example, we’ll discuss three regression models – Polynomial Regression, Support Vector Regression and Random Forest Regression.

Polynomial Regression 

In this regression, the relationship between response and predictors is modeled using an nth degree polymial of predictor variables. For a single predictor x and response variable y, the general polynomial regression model is:

 

y=yintercept+β1x+β2x2+β3x3++βnxn+ε

 

with regression coefficients β and random error ε.

n is the degree of the polynomial. Model is called linear, quadratic, cubic and quartic for n = 1, 2, 3 and 4, successively.

In this example, ‘iris’ dataset will be used. It consists of four measurements, the length and width of the petals and sepals of one hundred and fifty Iris flowers from three species of setosa, versicolor and virginica.

The numerical portion of the data can be summarized using correlation matrix. To do this, open MLearn, load built-in ‘iris’ dataset, select ‘Distribution’ tab and choose input variables. This’ll display the correlation matrix.

To develop regression model, select ‘Regression’ tab. In this example, we’ll develop a regression model to predict sepal length using petal length and petal width. Select ‘sepal_length’ in ‘response’ under ‘Variables’. Enter ‘petal_length’ and ‘petal_width’ under ‘predictors’. Select ‘Polynomial’ under ‘select model’ and ‘1’ under ‘degree polynomial’. Train/test split is already determined in ‘Data’ tab. Click ‘Submit Change’.

In first row, ‘Model Accuracy’ is displayed as R-square. Model summary is displayed in 3rd row.

first item in ‘Model Summary’ is the formula used to fit the data. The second item is residuals which is the difference between actual and predicted values. It shows residuals minimum, first quartile, median, third quartile and maximum value. Ideally, residuals should be symmetrically distribution around the mean value of zero.

In ‘Coefficients’ there’re three rows, first Intercept, then petal_length and finally petal_width. The terms after closing brackets are exponents of coefficients. ‘Estimate’ column shows estimated values of the coefficients. ‘Std Error’ shows the variability of the coefficient estimates. ‘t value’ is the ratio of estimated coefficient to standard error. This indicates how large the absolute value of the coefficient compared to the standard error.

Pr(>|t|) or p-value for each coefficient tests the null hypothesis (coefficient is equal to zero or no effect). A low p-value (< 0.05 or 5%) implies that the null hypothesis can be rejected (with 95% confidence). This means, a predictor with a low p-value is likely to be a meaningful addition to the model as changes in its value is related to changes in response variable. On the other hand, a larger p-value means changes in the predictor is not related to changes in response.

The ‘Residual standard error’ or root mean square error (RMSE) is like an average error of the model. It is calculated by using the formula:

 

RMSE = Sum of squared errorsDegrees of freedom

 

In this example, degrees of freedom is 117, calculated by n – k -1, where n is training data size and k is the number of predictors. R-square is a statistical measure of how close the data to the fitted line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. it is the percentage of the response variable variation that is explained by the model. It is calculated by the formula:

 

R2=1Sum of squared errorsSum of squared total 

 

In cases of multiple regression, R-square always increase as more predictor variables are chosen for the model. Adjusted R-square penalizes adding predictors to the model to adjust the R-square value.

 

Adjusted R2=1((1R2)(n1)/(nk1))

 

‘F-statistic’ represents the significance or goodness of fit of overall model instead of a single predictor. It is calculated by using the formula:

 

Fstatistic = Mean square regressionMean squared error

 

Next, F-test is performed to determine overall model significance. Null hypothesis is that the fit of intercept-only model (no predictor) and the developed model are equal. Alternative hypothesis is that the fit of current model compared to intercept-only model is significantly improved. P-value corresponds to the F-test, 2.2E-16 in this case, shows that the null-hypothesis can be rejected with more than 95% confidence.

To visualize predicted response against actual data, select variables under ‘Plot Data’ and click ‘Submit Change’.

This’ll plot the data.

Support Vector Regression

Based on the principles of statistics, optimization, and machine learning, Support Vector Machine (SVM) was proposed by Boser, Guyon, and Vapnik in 1992. SVM is a powerful machine learning technique that classifies data using optimal hyperplane. Support Vector Regression (SVR) is developed using SVM concepts to solve regression problems.

SVM formulates binary classification problem as convex optimization problem to find the maximum margin separating the hyperplane. Optimal hyperplane is represented by Support Vectors. SVM generalization to SVR is accomplished by introducing an ε-insensitive region or ε-tube. SVR optimization first defines a convex ε-insensitive loss function to be minimized and then finds the flattest tube containing most training data. This is how, a multipurpose function is constructed from the loss function and the geometrical aspects of the tube.

Using soft-margin approach of SVM, slack variables ξ and ξ* can be added for outliers to determine how many points are to be tolerated outside the tube. To apply SVR, select ‘Support Vector’ in ‘select model’ under ‘Model’.

Under model ‘type’ select either ‘eps-regression’ or ‘nu-regression. In ‘eps-regression’, selection of the number of support vectors are controlled by ε. In ‘nu-regression’, the parameter ν determine the number of support vectors and ε is optimally estimated. However, in this application, default values are used for both parameters.

So far, we’ve discussed data in feature space, assuming linear function. For non-linear functions, data is mapped to a higher dimensional space, named kernel space using kernels that satisfy Mercer’s condition. In this application, linear, polynomial, radial and sigmoid kernel function can be used by selecting ‘kernel’ under ‘Model’.

Random Forest Regression 

Random forest, proposed by Breiman in 2001, is an ensemble learning approach for classification and regression. It uses a large collection of uncorrelated decision trees called the random forest. Instead of developing a solution based on the output of a single decision tree, random forest aggregates the outputs from a number of trees. This is to reduce high variance and high bias of a single decision tree.

Decision tree regression is performed by splitting the dataset into multiple sections as shown in left of the above figure. The algorithm uses Entropy and Information Gain to construct the decision tree. In this figure X and Y are the predictor variables and Z is the response variable. Data points are presented in X-Y 2-D space. Predicted values are determined by taking average of the values within the terminal leaf or leaf node. For example, if X is less than X1 and Y is less than Y2 then terminal leaf is S-3. Predicted value Z1 is calculated by taking the average of the corresponding Z values within S-3.

Random Forest takes a random sample from the dataset, build a decision tree with that dataset and repeat the process for specified number of trees (ntree). Instead of providing solution based on single decision tree, it aggregates output from the number of trees. This forms an additional layer to bagging that constructs multiple predictions. Solution is provided by combining predictions through averaging.

To apply this model, select ‘Random Forest’ under ‘select model’, enter ’10’ under ‘ntree’ and click ‘Submit Change’. Under ‘Model Accuracy’, train data R-square is shown as ~ 89%. This is a significant increase compare to Polynomial model shown above.

Download Data

Export predicted data to local drive by clicking ‘Download Data’ on right.

Open the dataset from local drive to perform further analysis.