This example illustrates the use of partitioning with oversampling using XLMiner. On the XLMiner ribbon, from the Applying Your Model tab, select Help - Examples, then Forecasting/Data Mining Examples to open the example workbook, Catalog_multi.xlsx. Click any cell within the data set. This sample data set contains information associated with the response of a direct mail offer published by the Direct Marketing Educational Foundation. The output variable is Target dependent variable:buyer(yes=1). Since the success rate for the variable (Target dependent variable:buyer(yes=1)) is less than 1%, the data will be trained with a 50% success rate using XLMiner's oversampling utility.
On the XLMiner ribbon, from the Data Mining tab, select Partition - Partition with Oversampling to open the Partition With Oversampling dialog.
Confirm that Data range displays $A$1:$V$58206. If not, click in the Data range field and enter the correct range.
Select all variables in the Variables list, then click > to move all variables to the Variables in the Partition Data list. Highlight Target dependent variable: buyer(yes = 1) in the Variables in the Partition Data list, then next to Output variable, click < to designate this variable as the Output variable. Reminder: this output variable is limited to two classes (i.e., 0/1 or yes/no).
At Specify % validation data to be taken away as test data, enter 50. Click OK to partition the data.
The Data_Partition worksheet is inserted to the right of the Data worksheet.
The Output variable (Target dependent variable: buyer (yes = 1)) contains 576 success records or 1s, all of which have been allocated to the Training Set. The percentage of success records in the original data set is 0.9896 or 576/58204 (number of successes/number of total rows in original data set). In the Partition with Oversampling dialog, 50% was specified for both Specify % success in Training Set and Specify % Validation Set to be taken away as test data. As a result, XLMiner has randomly allocated 50% of the successes (the 1s) to the Training Set and the remaining 50% to the Validation Set (i.e., there are 288 successes in the Training Set, and 288 successes in the Validation Set). To complete the Training Set, XLMiner randomly selected 288 non successes (0s). The Training Set has 576 rows (288 1s + 288 0s).
The output above shows that the % Success in original data set is .9896. XLMiner maintains this percentage in the Validation Set by allocating as many 0s as needed. Since 288 successes (1s) have already been allocated to the Validation Set, 14,263 non successes (0s) must be added to the Validation Set to maintain the .98% ratio.
Since we specified 50% of Validation Set should be taken as test data, XLMiner has allocated 50% of the validation records to the test set. Each set contains 14,551 rows.