k-Nearest Neighbors (k-NN) Prediction

Introduction

In the k-Nearest Neighbor prediction (kNNP) method, the Training Set is used to predict the value of a variable of interest for each member of a target data set. The structure of the data generally consists of a variable of interest (i.e., amount purchased), and a number of additional predictor variables (age, income, location).

1. For each row (case) in the target data set (the set to be predicted), locate the k closest members (the k nearest neighbors) of the Training Set. A Euclidean Distance measure is used to calculate how close each member of the Training Set is to the target row that is being examined.

2. Find the weighted sum of the variable of interest for the k-nearest neighbors (the weights are the inverse of the distances).

3. Repeat this procedure for the remaining rows (cases) in the target set.

4. Additionally, Analytic Solver Data Science also allows the user to select a maximum value for k, builds models in parallel on all values of k (up to the maximum specified value) and performs scoring on the best of these models.

Computing time increases as k increases, but the advantage is that higher values of k provide smoothing that reduces vulnerability to noise in the Training Set. Typically, k is in units of tens of units, rather than in hundreds or thousands.