Two examples are used in this section to illustrate how to use Hierarchical Clustering. The first example uses Raw Data and the second example uses a distance matrix.

Hierarchical Cluster Using Raw Data

The utilities.xlsx example dataset (shown below) holds corporate data on 22 US public utilities. This example will illustrate how a user could use Analytic Solver Data Science to perform a cluster analysis using hierarchical clustering.

Open this example by clicking Help – Example Models -- Forecasting/Data Science Examples – Utilities.

Each record includes 8 observations. Before Hierarchical clustering is applied, the data will be “normalized”, or “standardized”. A popular method for normalizing continuous variables is to divide each variable by its standard deviation. After the variables are standardized, the distance can be computed between clusters using the Euclidean metric.

An explanation of the variables is contained in the table below.

X1: Fixed-charge covering ratio (income/debt)
X2: Rate of return on capital
X3: Cost per KW capacity in place
X4: Anuual Load Factor
X5: Peak KWH demand growth from 1974 to 1975
X6: Sales (KWH use per year)
X7: Percent Nuclear
X8: Total fuel costs (cents per KWH)

An economist analyzing this data might first begin her analysis by building a detailed cost model of the various utilities. However, to save a considerable amount of time and effort, she could instead cluster similar types of utilities, build a detailed cost model for just one ”typical” utility in each cluster, then from there, scale up from these models to estimate results for all utilities. This example will do just that.

Click Cluster -- Hierarchical Clustering to bring up the Hierarchical Clustering dialog.

On the Data tab, Select variables x1 through x8 in the Variables in Input Data field, then click > to move the selected variables to the Selected Variables field.

Leave Data Type at Raw Data at the bottom of the dialog.

Hierarchical Clustering, Data tab

Then click Next to advance to the Hierarchical Clustering.

At the top of the dialog, select Rescale data. Use this dialog to normalize one or more features in your data during the data preprocessing stage. Analytic Solver Data Science provides the following methods for feature scaling: Standardization, Normalization, Adjusted Normalization and Unit Norm. For more information on this feature, see the Rescale Continuous Data section within the Transform Continuous Data chapter that occurs earlier in this guide. For this example keep the default setting of Standardization. Then click Done to close the dialog.

Under Dissimilarity, Euclidean distance is selected by default. The Hierarchical clustering method uses the Euclidean Distance as the similarity measure for raw numeric data.

Note: When the data is binary the remaining two options, Jaccard's coefficients and Matching coefficients are enabled.

Under Linkage Method, select Group average linkage. Recall from the Introduction to this chapter, the group average linkage method calculates the average distance of all possible distances between each record in each cluster.

For purposes of assigning cases to clusters, we must specify the number of clusters in advance. Under Hierarchical: Display, increment Number of Clusters to 4. Keep the remaining options at their defaults as shown in the screenshot below. Then click Finish.

Hierarchical Clustering Dialog, Parameters tab

Analytic Solver Data Science will create four clusters using the group average linkage method. The output HC_Output, HC_Clusters and HC_Dendrogram are inserted to the right of the Data worksheet.

HC_Output Worksheet

The top portion of the output simply displays the options selected on the Hierarchical Clustering dialog tabs.

Analytic Solver Data Science creates four clusters using the Group Average Linkage method. The output worksheet HC_Output is inserted immediately to the right of the Data worksheet.

Hierarchical Clustering Output, Inputs

Further down the HC_Output sheet is the Clustering Stages table. This table details the history of the cluster formation. Initially, each individual case is considered its own cluster (single member in each cluster). Analytic Solver Data Science begins the method with # clusters = # cases. At stage 1, below, clusters (i.e. cases) 12 and 21 were found to be closer together than any other two clusters (i.e. cases), so they are joined together in to cluster 12. At this point there is one cluster with two cases (cases 12 and 21), and 21 additional clusters that still have just one case in each. At stage 2, clusters 10 and 13 are found to be closer together than any other two clusters, so they are joined together into cluster 10.

This process continues until there is just one cluster. At various stages of the clustering process, there are different numbers of clusters. A graph called a dendrogram illustrates these steps.

Hierarchical Clustering Output, Clustering Stages

HC_Dendrogram Output

Click the HC_Dendrogram worksheet tab to view the clustering dendrogram. A dendrogram is a diagram that illustrates the hierarchical association between the clusters.

The Sub Cluster IDs are listed along the x-axis (in an order convenient for showing the cluster structure). The y-axis measures inter-cluster distance. Consider Cluster IDs 3 and 8-- they have an inter-cluster distance of 2.753. (Hover over the horizontal connecting line to see the Between-Cluster Distance.) No other cases have a smaller inter-cluster distance, so 3 and 8 are joined into one cluster, indicated by the horizontal line linking them.

Hierarchical Clustering, Dendrogram

Next, we see that cases 1 and 5 have the next smallest inter-cluster distance, so they are joined into a 2nd cluster.

Hierarchical Clustering Output, Dendrogram

The next smallest inter-cluster distance is between the newly formed 3/8 and 1/5 clusters. This process repeats until all subclusters have been formed into 1 cluster.

Hierarchical Clustering Output, Dendrogram

If we draw a horizontal line through the diagram at any level on the y-axis (the distance measure), the vertical cluster lines that intersect the horizontal line indicate clusters whose members are at least that close to each other. If we draw a horizontal line at distance = 3.8, for example, we see that there are 4 clusters that have an inter-cluster distance of at least 3.8. In addition, we can see that a sub ID can belong to multiple clusters, depending on where we draw the line.

Click the ‘X’ in the upper right hand corner to close the dendrogram to view the Cluster Legend. This table shows the records that are assigned to each sub-cluster.

Hierarchical Clustering Example

HC_Output Worksheet

HC_Dendrogram Output