This example illustrates the use of XLMiner's k-Nearest Neighbors Prediction method. On the XLMiner ribbon, from the Applying Your Model tab, select Help - Examples, then Forecasting/Data Mining Examples to open the Boston_Housing.xlsx example data set. This data set contains 14 variables described in the table below. The dependent variable MEDV is the median value of a dwelling. The objective of this example is to predict the value of this variable.

CRIM

Per capita crime rate by town

ZN      

Proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS

Proportion of non-retail business acres per town

CHAS

Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX

Nitric oxides concentration (parts per 10 million)

RM

Average number of rooms per dwelling

AGE

Proportion of owner-occupied units built prior to 1940

DIS

Weighted distances to five Boston employment centers

RAD

Index of accessibility to radial highways

TAX     

Full-value property-tax rate per $10,000

PTRATIO

Pupil-teacher ratio by town

B

1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT   

% Lower status of the population

MEDV    

Median value of owner-occupied homes in $1000's

A portion of the data set is shown below. The last variable, CAT. MEDV, is a discrete classification of the MEDV variable, and will not be used in this example.

Boston_Housing.xlsx Example Dataset

Partition the data into Training and Validation Sets using the Standard Data Partition defaults, with percentages of 60% of the data randomly allocated to the Training Set, and 40% of the data randomly allocated to the Validation Set. For more information on partitioning a data set, see the Data Mining Partition section.

On the XLMiner ribbon, from the Data Mining tab, select Partition - Standard Partition to open the Standard Data Partition dialog.

Standard Data Partition

Select a cell on the Data_Partition worksheet.

On the XLMiner ribbon, from the Data Mining tab, select Predict - k-Nearest Neighbors to open the k-Nearest Neighbors Prediction - Step 1 of 2 dialog.

k-Nearest Neighbors Prediction - Step 1 of 2 Dialog

At Output Variable, select MEDV, and from the Selected Variables list, select the remaining variables (except CAT. MEDV).

Click Next to advance to the Step 2 of 2 dialog.

k-Nearest Neighbors Predication - Step 2 of 2 Dialog

Select Normalize Input data. When this option is selected, the input data is normalized, which means that all data is expressed in terms of standard deviations. This option is available to ensure that the distance measure is not dominated by variables with a large scale.

For Number of Nearest Neighbors (k), enter 5. This is the parameter k in the k-nearest neighbor algorithm. If the number of observations (rows) is less than 50 then the value of k should be between 1 and the total number of observations (rows). If the number of rows is greater than 50, then the value of k should be between 1 and 50.

Under Scoring Option, select Score on best k between 1 and specified value. XLMiner displays the output for the best k between 1 and 5. If Score on specified value of k as above is selected, the output is displayed for the specified value of k.

Under both Score Training Data and Score Validation Data, Summary Report is selected. Under both Score Training Data and Score Validataion Data, select Detailed Report, and Lift Charts show an assessment of the performance in predicting the Training Set.

The options under Score Test Data group are enabled only when a test partition is available. Since we did not create a test data set when we partitioned the data, these options are disabled. See the Data Mining Partition section for more information on how to create a test data set.

XLMiner V2015 provides the ability to partition a data set from within a classification or prediction method by selecting Partitioning Options on the Step 2 of 3 dialog. If this option is selected, XLMiner partitions data set immediately before running the prediction method. If partitioning has already occurred on the data set, this option is disabled. For more information on partitioning, see the Data Mining Partition section.

Click Finish. Worksheets containing the output of the method are inserted to the right of the Data_Partition worksheet. The KNNP_Output worksheet contains the Output Navigator, which allows easy access to all portions of the output.

k-Nearest Neighbors Prediction Method Output Navigator

Scroll down the KNNP_Output worksheet to the Validation error log for different k (shown below). As per our specifications, XLMiner has calculated the RMS error for all values of k, and denoted the value of k with the smallest RMS Error.

k-Nearest Neighbors Prediction Method Output:  Validation Error Log for Different k

Scroll down the page to view the Summary Report. This report summarizes the prediction error. The first number, the total sum of squared errors, is the sum of the squared deviations (residuals) between the predicted and actual values. The second is the square root of the average of the squared residuals. The third is the average deviation. All these values are calculated for the best k (k=4).

k-Nearest Neighbors Prediction Method Output:  Training/Validation Data Scoring - Summary Reports

On the Output Navigator, select the Valid. Score - Detailed Rpt. to display the Prediction of Validation Data table. For each record, this table displays the predicted value, the actual value, and the difference between them (the residuals).

k-Nearest Neighbors Prediction Method Output:  Prediction of Validation Data (for k = 4)

On the Output Navigator, click the Training Lift Chart and Validation Lift Chart links to display the Training and Validation Data Lift Charts and ROC Curves, respectively.

Lift Charts and RROC Curves are visual aids for measuring model performance. Lift Charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model. RROC (regression receiver operating characteristic) curves plot the performance of regressors by graphing over-estimations (or predicted values that are too high) versus underestimations (or predicted values that are too low.) The closer the curve is to the top-left corner of the graph (the smaller the area above the curve), the better the performance of the model.

After the model is built using the Training Set, the model is used to score on the Training Set and the Validation Set (if one exists). Then the data set(s) are sorted using the predicted output variable value. After sorting, the actual outcome values of the output variable are cumulated and the lift curve is drawn as the number of cases versus the cumulated value. The baseline (red line connecting the origin to the end point of the blue line) is drawn as the number of cases versus the average of actual output variable values multiplied by the number of cases. The decile-wise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's mean output variable value. This bars in this chart indicate the factor by which the MLR model outperforms a random assignment, one decile at a time. Refer to the validation graph below. In the first decile, taking the most expensive predicted housing prices in the dataset, the predictive performance of the model is almost 2 times better as simply assigning a random predicted value.

In an RROC curve, we can compare the performance of a regressor with that of a random guess (red line) for which under-estimations are equal to over-estimations shifted to the minimum under estimate. Anything to the left of this line signifies a better prediction and anything to the right signifies a worse prediction. The best possible prediction performance would be denoted by a point at the top left of the graph at the intersection of the x and y axis. This point is sometimes referred to as the perfect classification. Area Over the Curve (AOC) is the space in the graph that appears above the ROC curve and is calculated using the formula: sigma2 * n2/2 where n is the number of records The smaller the AOC, the better the performance of the model. In this example we see that the area above the curve in the Training Set is 0, which indicates a perfect classification or zero misclassification errors. The AOC for the Validation Set is a bit high, which means the model may be overfit to the training data.

k-Nearest Neighbors Prediction Method Output:  Training Data Lift Charts

  k-Nearest Neighbors Prediction Method Output:  Training Data Lift Charts

For information on the KNNP_Stored worksheet, see the Scoring New Data section.