k-Nearest Neighbors Classification Method

Introduction

In the K-Nearest Neighbors Classification method, the Training Set is used to classify each member of a target data set. The structure of the data is that there is a classification (categorical) variable (i.e., buyer, or non-buyer), and a number of additional predictor variables (i.e., age, income, location).

1. For each row (case) in the target data set (the set to be classified), the k-nearest neighbors of the Training Set are located. A Euclidean Distance measure is used to calculate how close each member of the Training Set is to the target row that is being examined.

2. Examine the k-nearest neighbors -- in which classification (category) do most of them belong? Assign this category to the row being examined.

3. Repeat this procedure for the remaining rows (cases) in the target set.

4. Analytic Solver Data Science allows the user to select a maximum value for k, and builds models in parallel on all values of k up to the maximum specified value. Additional scoring can be performed on the best of these models.

As k increases, the computing time increases. However, a larger k will reduce the vulnerability of the Training Set to outside variability, which might offset the increased time requirement. In most applications, k is in units of tens, rather than in hundreds or thousands.