Ensemble Method Options

The following options appear on the Bagging, Boosting, and Random Trees Data dialogs.

Ensemble Method Classification Dialog, Data tab

Please see below for options appearing on the Ensemble Methods- Data tab.

Boosting Classification dialog, Data tab

Variables In Input Data

The variables included in the dataset appear here.

Selected Variables

Variables selected to be included in the output appear here.

Categorical Variables

Place categorical variables from the Variables listbox to be included in the model by clicking the > command button. Ensemble Methods will accept non-numeric categorical variables.

Output Variable

The dependent variable or the variable to be classified appears here.

Number of Classes

Displays the number of classes in the Output variable.

Success Class

This option is selected by default. Click the drop down arrow to select the value to specify a “success”. This option is only enabled when the # of classes is equal to 2.

Success Probability Cutoff

Enter a value between 0 and 1 here to denote the cutoff probability for success. If the calculated probability for success for an observation is greater than or equal to this value, than a “success” (or a 1) will be predicted for that observation. If the calculated probability for success for an observation is less than this value, then a “non-success” (or a 0) will be predicted for that observation. The default value is 0.5. This option is only enabled when the # of classes is equal to 2.

Boosting Classification Dialog, Parameters tab

Please see below for options appearing on the Boosting – Parameters tab.

Partition Data

Analytic Solver Data Science includes the ability to partition a dataset from within a classification or prediction method by clicking Partition Data on the Parameters tab. Click Partition Data to open the Partitioning dialog. Analytic Solver Data Science will partition your dataset (according to the partition options you set) immediately before running the classification method. If partitioning has already occurred on the dataset, this option will be disabled. For more information on partitioning, please see the Data Science Partitioning chapter.

On-the-fly Partitioning Dialog

Rescale Data

Use Rescaling to normalize one or more features in your data during the data preprocessing stage. Analytic Solver Data Science provides the following methods for feature scaling: Standardization, Normalization, Adjusted Normalization and Unit Norm. For more information on this new feature, see the Rescale Continuous Data section within the Transform Continuous Data chapter that occurs earlier in this guide.

On-the-fly Rescaling Dialog

Analytic Solver Data Mining: Notes on Rescaling and Simulation functionality

Number of Weak Learners

This option controls the number of "weak" classification models that will be created. The ensemble method will stop when the number or classification models created reaches the value set for this option. The algorithm will then compute the weighted sum of votes for each class and assign the "winning" classification to each record.

Weak Learner

Under Ensemble: Classification click the down arrow beneath Weak Leaner to select one of the six featured classifiers: Discriminant Analysis, Logistic Regression, k-NN, Naïve Bayes, Neural Networks, or Decision Trees. After a weak learner is chosen, the command button to the right will be enabled. Click this command button to control various option settings for the weak leaner.

AdaBoost Variant

The difference in the algorithms is the way in which the weights assigned to each observation or record are updated. (Please refer to the section Ensemble Methods in the Introduction to the chapter.)

In AdaBoost.M1 (Freund), the constant is calculated as:

In AdaBoost.M1 (Breiman), the constant is calculated as:

In SAMME, the constant is calculated as:

(When the number of categories is equal to 2, SAMME behaves the same as AdaBoost Breiman.)

Random Seed for Resampling

If an integer value appears for Random Seed for Resampling, Analytic Solver Data Science will use this value to specify the seed for random resampling of the training data for each weak learner. Setting the random number seed to a nonzero value (any number of your choice is OK) ensures that the same sequence of random numbers is used each time the dataset is chosen for the classifier. The default value is “12345”. If left blank, the random number generator is initialized from the system clock, so the sequence of random numbers will be different in each calculation. If you need the results from successive runs of the algorithm to another to be strictly comparable, you should set the seed. To do this, type the desired number you want into the box. This option accepts both positive and negative integers with up to 9 digits.

Show Weak Learner

To display the weak learner models in the output, select Show Weak Learner Models.

Bagging Classification Dialog, Parameters Dialog

Please see below for options that are unique to the Bagging -- Parameters tab. For remaining option explanations, please see above.

Random Seed for Bootstrapping

If an integer value appears for Bootstrapping Random seed, Analytic Solver Data Science will use this value to set the bootstrapping random number seed. Setting the random number seed to a nonzero value (any number of your choice is OK) ensures that the same sequence of random numbers is used each time the dataset is chosen for the classifier. The default value is "12345". If left blank, the random number generator is initialized from the system clock, so the sequence of random numbers will be different in each calculation. If you need the results from successive runs of the algorithm to another to be strictly comparable, you should set the seed. To do this, type the desired number you want into the box. This option accepts both positive and negative integers with up to 9 digits.

Random Trees Classification Dialog, Parameters Dialog

Please see below for options that are unique to the Random Trees -- Parameters dialog. For remaining option explanations, please see above.

Number of Randomly Selected Features

The Random Trees ensemble method works by training multiple "weak" classification trees using a fixed number of randomly selected features then taking the mode of each class to create a "strong" classifier. The option Number of randomly selected features controls the fixed number of randomly selected features in the algorithm. The default setting is 3.

Random Seed for Featured Selection

If an integer value appears for Feature Selection Random seed, Analytic Solver Data Science will use this value to set the feature selection random number seed. Setting the random number seed to a nonzero value (any number of your choice is OK) ensures that the same sequence of random numbers is used each time the dataset is chosen for the classifier. The default value is "12345". If left blank, the random number generator is initialized from the system clock, so the sequence of random numbers will be different in each calculation. If you need the results from successive runs of the algorithm to another to be strictly comparable, you should set the seed. To do this, type the desired number you want into the box. This option accepts both positive and negative integers with up to 9 digits.

Random Trees Classification Dialog, Scoring Dialog

Please see below for options that are unique to the Ensemble Methods -- Scoring dialog.

Score Training Data

Select these options to show an assessment of the performance of the Ensemble Methods in classifying the training data. The report is displayed according to your specifications - Detailed, Summary, Frequency and Lift charts. Lift charts are only available when the Output Variable contains 2 categories.

When Frequency Chart is selected, a frequency chart will be displayed when the <EnsembleMethod>_TrainingScore worksheet is selected. This chart will display an interactive application similar to the Analyze Data feature, explained in detail in the Analyze Data chapter that appears earlier in this guide. This chart will include frequency distributions of the actual and predicted responses individually, or side-by-side, depending on the user’s preference, as well as basic and advanced statistics for variables, percentiles, six sigma indices.

Score Validation Data

These options are enabled when a validation data set is present. Select these options to show an assessment of the performance of the Ensemble Method algorithm in classifying the validation data. The report is displayed according to your specifications - Detailed, Summary, Frequency or Lift charts. Lift charts are only available when the Output Variable contains 2 categories. When Frequency Chart is selected, a frequency chart (described above) will be displayed when the <EnsembleMethod>_ValidationScore worksheet is selected.

Score Test Data

These options are enabled when a test set is present. Select these options to show an assessment of the performance of the Ensemble Method in classifying the test data. The report is displayed according to your specifications - Detailed, Summary, Frequency and Lift charts Lift charts are only available when the Output Variable contains 2 categories. When Frequency Chart is selected, a frequency chart (described above) will be displayed when the <EnsembleMethod>_TestScore worksheet is selected.

Score New Data

See the Scoring chapter within the Analytic Solver Data Science User Guide for more information on the options located in the Score Test Data and Score New Data groups.

Ensemble Method Classification Dialog, Simulation Dialog

All supervised algorithms include a new Simulation tab. This tab uses the functionality from the Generate Data feature (described earlier in this guide) to generate synthetic data based on the training partition, and uses the fitted model to produce predictions for the synthetic data. The resulting report, <EnsembleMethod>_Simulation, will contain the synthetic data, the predicted values and the Excel-calculated Expression column, if present. In addition, frequency charts containing the Predicted, Training, and Expression (if present) sources or a combination of any pair may be viewed, if the charts are of the same type.

Evaluation: Select Calculate Expression to amend an Expression column onto the frequency chart displayed on the <EnsembleMethod>_Simulation output tab. Expression can be any valid Excel formula that references a variable and the response as [@COLUMN_NAME]. Click the Expression Hints button for more information on entering an expression.

Options for Ensemble Methods