The following example illustrates XLMiner's Multiple Linear Regression method using the Boston Housing data set to predict the median house prices in housing tracts. This data set has 14 variables. A description of each variable is given in the following table. In addition to these variables, the data set also contains an additional variable, Cat. MEDV, which has been created by categorizing median value (MEDV) into two categories: high (MEDV > 30) and low (MEDV < 30). This variable will not be used in this example.

Boston Housing Variable Explanations

On the XLMiner ribbon, from the Applying Your Model tab, select Help - Examples, then Forecasting/Data Mining Examples to open the Boston_Housing.xlsx from the data sets folder. A portion of the data set is shown below.

 Boston_Housing.xlsx Example Dataset

To partition the data into Training and Validation Sets, use the Standard Data Partition defaults with percentages of 60% of the data randomly allocated to the Training Set, and 40% of the data randomly allocated to the Validation Set. For more information on partitioning a data set, see the Data Mining Partition section.

On the XLMiner ribbon, from the Data Mining tab, select Partition - Standard Partition to open the Standard Data Partition dialog.

Standard Data Partition Dialog 

Select a cell on the Data_Partition worksheet. On the XLMiner ribbon, from the Data Mining tab, select Predict - Multiple Linear Regression to open the Multiple Linear Regression - Step 1 of 2 dialog.

At Output Variable, select MEDV, and from the Selected Variables list, select all remaining variables (except CAT. MEDV).

Multiple Linear Regression - Step 1 of 2 

Click Next to advance to the Step 2 of 2 dialog.

If the number of rows in the data is less than the number of variables selected as Input variables, XLMiner displays the following prompt. Select OK to advance to the Variable Selection dialog.

Multiple Linear Regression Prediction Error 

If  Force constant term to zero is selected, there is constant term in the equation. Leave this option unchecked for this example.

Select Fitted values. When this option is selected, the fitted values are displayed in the output.

Select ANOVA table. When this option is selected, the ANOVA table is displayed in the output.

Under Residuals, select Standardized to display the Standardized Residuals in the output. Standardized residuals are obtained by dividing the unstandardized residuals by the respective standard deviations.

Under Residuals, select Unstandardized to display the Unstandardized Residuals in the output, which are computed by the formula: Unstandardized residual = Actual response - Predicted response.

Select Variance-covariance matrix. When this option is selected, the variance-covariance matrix of the estimated regression coefficients is displayed in the output.

Under Score Training Data and Score Validation Data, select all options to produce all four reports in the output. Since we did not create a Test Partition, the options under Score Test Data are disabled.

XLMiner V2015 provides the ability to partition a data set from within a classification or prediction method by selecting Partitioning Options on the Step 2 of 2 dialog. If this option is selected, XLMiner partitions the data set before running the prediction method. If partitioning has already occurred on the data set, this option is disabled. For more information on partitioning, please see the Data Mining Partition section.

Multiple Linear Regression - Step 2 of 2 

Click Advanced to display the Multiple Linear Regression - Advanced Options dialog.

Multiple Linear Regression - Advanced 

Select Studentized. When this option is selected, the Studentized Residuals are displayed in the output. Studentized residuals are computed by dividing the unstandardized residuals by quantities related to the diagonal elements of the hat matrix, using a common scale estimate computed without the ith case in the model. These residuals have t - distributions with ( n-k-1) degrees of freedom. As a result, any residual with absolute value exceeding 3 usually requires attention.

Select Deleted. When this option is selected, the Deleted Residuals are displayed in the output. This residual is computed for the ith observation by first fitting a model without the ith observation, then using this model to predict the ith observation. Afterwards the difference is taken between the predicted observation and the actual observation.

Select Cooks Distance to display the distance for each observation in the output. This is an overall measure of the impact of the ith datapoint on the estimated regression coefficient. In linear models Cooks Distance has, approximately, an F distribution with k and (n-k) degrees of freedom.

Select DF fits. When this checkbox is selected, the DF fits for each observation is displayed in the output. DFFits provides information on how the fitted model would change if a point was not included in the model.

XLMiner computes DFFits using the following computation

DF Fits Formula 

where,

y_hat_i = i-th fitted value from full model

y_hat_i(-i) = i-th fitted value from model not including i-th observation

sigma(-i) = estimated error variance of model not including i-th observation

h_i = leverage of i-th point (i.e. {i,i}-th element of Hat Matrix)

e_i = i-th residual from full model

e_i^stud = i-th Studentized residual

Select Covariance Ratios. When this is selected, the covariance ratios are displayed in the output. This measure reflects the change in the variance-covariance matrix of the estimated coefficients when the ith observation is deleted.

Select Hat Matrix Diagonals. When this checkbox is selected, the diagonal elements of the hat matrix are displayed in the output. This measure is also known as the leverage of the ith observation.

Select Perform Collinearity Diagnostics. When this checkbox is selected, the collinearity diagnostics are displayed in the output.

Click OK to return to the Step 2 of 2 dialog, then click Variable Selection (on the Step 2 of 2 dialog) to open the Variable Selection dialog.

Multiple Linear Regression - Variable Selection 

When you have a large number of predictors and you would like to limit the model to only the significant variables, select Perform Variable selection to select the best subset of variables. From the drop-down arrows, specify 13 for the size of best subset. This option can take on values of 1 up to N, where N is the number of input variables. The default setting is N, the number of input variables selected in the Step 1 of 2 dialog.

XLMiner offers the following five selection procedures for selecting the best subset of variables.

Backward Elimination in which variables are eliminated one at a time, starting with the least significant. If this procedure is selected, FOUT is enabled. A statistic is calculated when variables are eliminated. For a variable to leave the regression, the statistic's value must be less than the value of FOUT (default = 2.71).

Forward Selection in which variables are added one at a time, starting with the most significant. If this procedure is selected, FIN is enabled. A statistic is calculated when variables are added. For a variable to come into the regression, the statistic's value must be greater than the value for FIN (default = 3.84).

Sequential Replacement in which variables are sequentially replaced and replacements that improve performance are retained.

Best Subsets where searches of all combinations of variables are performed to observe which combination has the best fit. This option can become quite time consuming depending upon the number of input variables. If this procedure is selected, Number of best subsets is enabled.

Stepwise selection is similar to Forward selection except that at each stage, XLMiner considers dropping variables that are not statistically significant. When this procedure is selected, the Stepwise selection options FIN and FOUT are enabled. In the stepwise selection procedure a statistic is calculated when variables are added or eliminated. For a variable to come into the regression, the statistic's value must be greater than the value for FIN (default = 3.84). For a variable to leave the regression, the statistic's value must be less than the value of FOUT (default = 2.71). The value for FIN must be greater than the value for FOUT.

Click OK to return to the Step 2 of 2 dialog, then click Finish. Click the MLR_Output worksheet to find the Output Navigator. Click any link here to display the selected output or to view any of the selections made on the three dialogs.

Multiple Linear Regression Output:  Output Navigator 

Multiple Linear Regression Output:  Inputs

On the Output Navigator, click the Predictors hyperlink to display the Model Predictors table. In Analytic Solver Platform, Analytic Solver Pro, XLMiner Platform, and XLMiner Pro V2015, a new pre-processing feature selection step has been added to prevent predictors causing rank deficiency of the design matrix from becoming part of the model. Included and excluded predictors are shown in the Model Predictors table. In this model, there were no excluded predictors. All predictors were eligible to enter the model passing the tolerance threshold of 5.23E-10. This denotes a tolerance beyond which a variance-covariance matrix is not exactly singular to within machine precision. The test is based on the diagonal elements of the triangular factor R resulting from Rank-Revealing QR Decomposition. Predictors that do not pass the test are excluded.

If a predictor is excluded, the corresponding coefficient estimates will be 0 in the regression model and the variable-covariance matrix would contain all zeros in the rows and columns that correspond to the excluded predictor. Multicollinearity diagnostics, variable selection, and other remaining output is calculated for the reduced model.

The design matrix may be rank-deficient for several reasons. The most common cause of an ill-conditioned regression problem is the presence of feature(s) that can be exactly or approximately represented by a linear combination of other feature(s). For example, assume that among predictors you have three input variables X, Y, and Z, where Z = a * X + b * Y, where a and b are constants. This will cause the design matrix to not have a full rank. Therefore, one of these three variables will not pass the threshold for entrance and will be excluded from the final regression model.

Multiple Linear Regression Output:  Model Predictors Table 

 

See the following Model Predictors table example with three excluded predictors: Opening Theatre, Genre_Romantic, and Studio_IRS.

MLR Note

 

On the Output Navigator, click the Train. Score - Detailed Rep. link to open the Multiple Linear Regression - Prediction of Training Data table. Of primary interest in a data-mining context, will be the predicted and actual values for each record, along with the residual (difference) and Confidence and Prediction Intervals for each predicted value.

XLMiner produces 95% Confidence and Prediction Intervals for the predicted values. Typically, Prediction Intervals are more widely utilized as they are a more robust range for the predicted value. For a given record, the Confidence Interval gives the mean value estimation with 95% probability. This means that with 95% probability, the regression line will pass through this interval. The Prediction Interval takes into account possible future deviations of the predicted response from the mean. There is a 95% chance that the predicted value will lie within the Prediction interval.

Multiple Linear Regression Output:  Prediction of Training Data 

XLMiner displays The Total sum of squared errors summaries for both the Training and Validation Sets on the MLR_Output worksheet. The total sum of squared errors is the sum of the squared errors (deviations between predicted and actual values), and the root mean square error (square root of the average squared error). The average error is typically very small, because positive prediction errors tend to be counterbalanced by negative ones.

Multiple Linear Regression Prediction Method Output:  Training and Validation Data Scoring - Summary Report 

Because the optin was selected on the Multiple Linear Regression - Advanced Options dialog, a variety of residual and collinearity diagnostics output is available.

On the Output Navigator, click the Variable Selection link to display the Variable Selection table that displays a list of models generated using the selections from the Variable Selection table. When Backward elimination is used, Multiple Linear Regression may stop early when there is no variable eligible for elimination, as evidenced in the table below (i.e., there are no subsets with less than 12 coefficients).

Multiple Linear Regression Prediction Method Output:  Variable Selection 

The error values calculated are

RSS: The residual sum of squares, or the sum of squared deviations between the predicted probability of success and the actual value (1 or 0).

Cp: Mallows Cp (Total squared error) is a measure of the error in the best subset model, relative to the error incorporating all variables. Adequate models are those for which Cp is roughly equal to the number of parameters in the model (including the constant), and/or Cp is at a minimum

R-Squared: R-squared Goodness-of-fit

Adj. R-Squared: Adjusted R-Squared values

Probability is a quasi hypothesis test of the proposition that a given subset is acceptable; if Probability < .05 we can rule out that subset.

Compare the RSS value as the number of coefficients in the subset decreases from 13 to 12 (6784.366 to 6811.265). The RSS for 12 coefficients is just slightly higher than the RSS for 13 coefficients suggesting that a model with 12 coefficients may be sufficient to fit a regression.

On the Output Navigator, click the Regress. Model link to display the Regression Model table.

Multiple Linear Regression Prediction Method Output:  The Regression Model 

The Regression Model table contains the coefficient, the standard error of the coefficient, the p-value and the Sum of Squared Error for each variable included in the model. The Sum of Squared Errors is calculated as each variable is introduced in the model, beginning with the constant term and continuing with each variable as it appears in the data set.

Summary statistics (to the above right) show the residual degrees of freedom (#observations - #predictors), the R-squared value, a standard deviation type measure for the model (i.e., has a chi-square distribution), and the Residual Sum of Squares error.

The R-squared value shown here is the r-squared value for a logistic regression model, defined as 

R2 = (D0-D)/D0 ,

where, D is the Deviance based on the fitted model and D0 is the deviance based on the null model. The null model is defined as the model containing no predictor variables apart from the constant.

If a variable has been eliminated by Rank-Revealing QR Decomposition, the variable appears in red in the Regression Model table with a 0 Coefficient, Std. Error, CI Lower, CI Upper, and RSS Reduction and N/A for the t-Statistic and P-Values. The following example Regression Model table displays the results when three predictors (Opening Theaters, Genre_Romantic Comedy, and Studio_IRS) are eliminated.

MLR Note

On the Output Navigator, click the Collinearity Diags link to display the Collinearity Diagnostics table. This table assesses whether two or more variables so closely track one another as to provide essentially the same information. As you can see, the NOX variable was ignored.

MLRPredictOutputCollinearityDiagnostics.jpg

The columns represent the variance components (related to principal components in multivariate analysis), while the rows represent the variance proportion decomposition explained by each variable in the model. The eigenvalues are those associated with the singular value decomposition of the variance-covariance matrix of the coefficients, while the condition numbers are the ratios of the square root of the largest eigenvalue to all the rest. In general, multicollinearity is likely to be a problem with a high condition number (more than 20 or 30), and high variance decomposition proportions (say more than 0.5) for two or more variables.

Lift Charts and RROC Curves (on the MLR_TrainingLiftChart and MLR_ValidationLiftChart, respectively) are visual aids for measuring model performance. Lift Charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model. RROC (regression receiver operating characteristic) curves plot the performance of regressors by graphing over-estimations (predicted values that are too high) versus underestimations (predicted values that are too low.) The closer the curve is to the top-left corner of the graph (the smaller the area above the curve), the better the performance of the model.

After the model is built using the Training Set, the model is used to score on the Training Set and the Validation Set (if one exists). Then the data set(s) are sorted using the predicted output variable value. After sorting, the actual outcome values of the output variable are cumulated and the lift curve is drawn as the number of cases versus the cumulated value. The baseline (red line connecting the origin to the end point of the blue line) is drawn as the number of cases versus the average of actual output variable values multiplied by the number of cases. The decile-wise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's mean output variable value. This bars in this chart indicate the factor by which the MLR model outperforms a random assignment, one decile at a time. Refer to the validation graph below. In the first decile, taking the most expensive predicted housing prices in the dataset, the predictive performance of the model is about 1.7 times better as simply assigning a random predicted value.

In an RROC curve, we can compare the performance of a regressor with that of a random guess (red line) for which over-estimations are equal to under-estimations. Anything to the left of this line signifies a better prediction, and anything to the right signifies a worse prediction. The best possible prediction performance would be denoted by a point at the top-left of the graph at the intersection of the x and y axis. This point is sometimes referred to as the perfect classification. Area Over the Curve (AOC) is the space in the graph that appears above the ROC curve and is calculated using the formula: sigma2 * n2/2 where n is the number of records The smaller the AOC, the better the performance of the model. In this example, we see that the area above the curve in both data sets, or the AOC, is fairly small, which indicates that this model is a good fit to the data.

  Multiple Linear Regression Prediction Method Output:  Training Data Lift Chart

Multiple Linear Regression Prediction Method Output:  Validation Data Lift Chart

For information on the MLR_Stored worksheet, see the Scoring New Data section.