This example illustrates the use of XLMiner's k-Nearest Neighbors Prediction method. On the XLMiner ribbon, from the **Applying Your Model** tab, select **Help - Examples**, then **Forecasting/Data Mining Examples** to open the** Boston_Housing.xlsx** example data set. This data set contains 14 variables described in the table below. The dependent variable MEDV is the median value of a dwelling. The objective of this example is to predict the value of this variable.

CRIM |
Per capita crime rate by town |

ZN |
Proportion of residential land zoned for lots over 25,000 sq.ft. |

INDUS |
Proportion of non-retail business acres per town |

CHAS |
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |

NOX |
Nitric oxides concentration (parts per 10 million) |

RM |
Average number of rooms per dwelling |

AGE |
Proportion of owner-occupied units built prior to 1940 |

DIS |
Weighted distances to five Boston employment centers |

RAD |
Index of accessibility to radial highways |

TAX |
Full-value property-tax rate per $10,000 |

PTRATIO |
Pupil-teacher ratio by town |

B |
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |

LSTAT |
% Lower status of the population |

MEDV |
Median value of owner-occupied homes in $1000's |

A portion of the data set is shown below. The last variable, CAT. MEDV, is a discrete classification of the MEDV variable, and will not be used in this example.

Partition the data into Training and Validation Sets using the Standard Data Partition defaults, with percentages of 60% of the data randomly allocated to the Training Set, and 40% of the data randomly allocated to the Validation Set. For more information on partitioning a data set, see the Data Mining Partition section.

On the XLMiner ribbon, from the **Data Mining** tab, select **Partition - Standard Partition** to open the *Standard Data Partition* dialog.

Select a cell on the **Data_Partition** worksheet.

On the XLMiner ribbon, from the **Data Mining** tab, select **Predict - k-Nearest Neighbors** to open the *k-Nearest Neighbors Prediction - Step 1 of 2* dialog.

At Output Variable, select MEDV, and from the **Selected Variables** list, select the remaining variables (except CAT. MEDV).

Click **Next** to advance to the *Step 2 of 2 *dialog.

Select Normalize Input data. When this option is selected, the input data is normalized, which means that all data is expressed in terms of standard deviations. This option is available to ensure that the distance measure is not dominated by variables with a large scale.

For Number of Nearest Neighbors (k), enter 5. This is the parameter k in the k-nearest neighbor algorithm. If the number of observations (rows) is less than 50 then the value of k should be between 1 and the total number of observations (rows). If the number of rows is greater than 50, then the value of k should be between 1 and 50.

Under Scoring Option, select** **Score on best k between 1 and specified value. XLMiner displays the output for the best k between 1 and 5. If Score on specified value of k as above is selected, the output is displayed for the specified value of k.

Under both Score Training Data and Score Validation Data, Summary Report is selected. Under both Score Training Data and Score Validataion Data, select Detailed Report, and Lift Charts show an assessment of the performance in predicting the Training Set.

The options under Score Test Data group are enabled only when a test partition is available. Since we did not create a test data set when we partitioned the data, these options are disabled. See the Data Mining Partition section for more information on how to create a test data set.

XLMiner V2015 provides the ability to partition a data set from within a classification or prediction method by selecting Partitioning Options on the *Step 2 of 3 *dialog. If this option is selected, XLMiner partitions data set immediately before running the prediction method. If partitioning has already occurred on the data set, this option is disabled. For more information on partitioning, see the Data Mining Partition section.

Click **Finish**. Worksheets containing the output of the method are inserted to the right of the **Data_Partition** worksheet. The **KNNP_Output **worksheet contains the Output Navigator, which allows easy access to all portions of the output.

Scroll down the **KNNP_Output **worksheet to the **Validation error log for different k **(shown below). As per our specifications, XLMiner has calculated the RMS error for all values of k, and denoted the value of k with the smallest RMS Error.

Scroll down the page to view the **Summary Report**. This report summarizes the prediction error. The first number, the total sum of squared errors, is the sum of the squared deviations (residuals) between the predicted and actual values. The second is the square root of the average of the squared residuals. The third is the average deviation. All these values are calculated for the best k (k=4).

On the Output Navigator, select the **Valid. Score - Detailed Rpt.** to display the **Prediction of Validation Data** table. For each record, this table displays the predicted value, the actual value, and the difference between them (the residuals).

On the Output Navigator, click the **Training Lift Chart** and **Validation Lift Chart** links to display the Training and Validation Data Lift Charts and ROC Curves, respectively.

Lift Charts and RROC Curves are visual aids for measuring model performance. Lift Charts consist of a lift curve and a baseline. The greater the area between the lift curve and the baseline, the better the model. RROC (regression receiver operating characteristic) curves plot the performance of regressors by graphing over-estimations (or predicted values that are too high) versus underestimations (or predicted values that are too low.) The closer the curve is to the top-left corner of the graph (the smaller the area above the curve), the better the performance of the model.

After the model is built using the Training Set, the model is used to score on the Training Set and the Validation Set (if one exists). Then the data set(s) are sorted using the predicted output variable value. After sorting, the actual outcome values of the output variable are cumulated and the lift curve is drawn as the number of cases versus the cumulated value. The baseline (red line connecting the origin to the end point of the blue line) is drawn as the number of cases versus the average of actual output variable values multiplied by the number of cases. The decile-wise lift curve is drawn as the decile number versus the cumulative actual output variable value divided by the decile's mean output variable value. This bars in this chart indicate the factor by which the MLR model outperforms a random assignment, one decile at a time. Refer to the validation graph below. In the first decile, taking the most expensive predicted housing prices in the dataset, the predictive performance of the model is almost 2 times better as simply assigning a random predicted value.

In an RROC curve, we can compare the performance of a regressor with that of a random guess (red line) for which under-estimations are equal to over-estimations shifted to the minimum under estimate. Anything to the left of this line signifies a better prediction and anything to the right signifies a worse prediction. The best possible prediction performance would be denoted by a point at the top left of the graph at the intersection of the x and y axis. This point is sometimes referred to as the perfect classification. Area Over the Curve (AOC) is the space in the graph that appears above the ROC curve and is calculated using the formula: sigma2 * n2/2 where n is the number of records The smaller the AOC, the better the performance of the model. In this example we see that the area above the curve in the Training Set is 0, which indicates a perfect classification or zero misclassification errors. The AOC for the Validation Set is a bit high, which means the model may be overfit to the training data.

For information on the **KNNP_Stored** worksheet, see the Scoring New Data section.