Cluster analysis diagrams

Cluster analysis is an exploratory technique that helps you discover patterns in your data by grouping files, codes or cases that share words, attribute values or coding. It produces diagrams that graphically represent the similarity or dissimilarity of the items you are comparing by using color (to identify 'clusters') and positioning of the items relative to each other—similar items are close together and dissimilar items apart.

IMPORTANT Cluster analysis is a useful way to explore your data but any apparent associations it reveals should be further investigated using methods such as coding queries or matrix coding queries. Cluster analysis should not be seen as producing research results in its own right.

You can use cluster analysis diagrams to visualize:

The similarities and differences across your files—for example, how similar are the submissions from the various community members?
The similarities and differences across your codes—for example, how similar is the coding to rising sea levels, flood control, soil erosion, and land reclamation?
The demographic spread of your survey respondents based on attribute value.

The Diagram tab displays the visual representation of your data.

Example of a cluster analysis diagram.

The Summary tab displays the similarity index values used to generate the diagram.

Summary tab for a cluster analysis diagram.

1 Items compared—each possible pair of selected items is listed as a row in the table.

2 Similarity Index—displays a value that indicates the degree of similarity for each pair of items based on the similarity metric selected. Items with a high similarity index (maximum=1) indicate a strong similarity and are displayed closer together on the cluster analysis diagram.

Create a cluster analysis diagram

On the Explore tab in the Diagrams group, click Cluster Analysis.
Follow the steps in the Cluster Analysis Wizard, and then click Finish.

You can also view Word Frequency query results as a cluster analysis diagram. This type of cluster analysis diagram displays the most frequently occurring words in the selected files or codes.

Cluster by word, coding or attribute value similarity

The files or codes in a cluster analysis diagram, can be clustered by word similarity, coding similarity or attribute value similarity.

Cluster by	Description
Word similarity	The words contained in the selected files or codes are compared. Files or codes that have a higher degree of similarity based on the occurrence and frequency of words are shown clustered together. Files or codes that have a lower degree of similarity based on the occurrence and frequency of words are displayed further apart. Stop words are excluded when using this measure of similarity. Text content language & stop words
Coding similarity	The coding to the selected files or codes is compared. Files or codes that have been coded similarly are clustered together on the cluster analysis diagram. Files or codes that have been coded differently are displayed further apart on the cluster analysis diagram.
Attribute value similarity	The attribute values of the selected files or codes are compared. Files or codes that have similar attribute values are clustered together on the cluster analysis diagram. Files or codes that have different attribute values are displayed further apart on the cluster analysis diagram.

Selecting a similarity metric

A similarity metric is a statistical method used to calculate correlation between items. When you create a cluster analysis diagram using the Cluster Analysis Wizard, you can choose from the following similarity metrics:

How are cluster analysis diagrams generated?

Working with data in other languages

The language used in your data has no impact on the results for cluster analysis by coding or attribute value similarity.

For cluster analysis by word similarity, NVivo will exclude any defined ‘stop words’ from the similarity calculation. When you are working with content in other languages, stop words will improve the outcome of your cluster analysis by excluding similarity based on words which convey less meaning. This will reduce the chance that documents will have a high similarity coefficient based predominantly on these words. To check which stop words apply to your content, you can view the Stop Words list.

For example, if you are working with content in Turkish, you might like to:

Set the text content language to ‘Other’.
Add appropriate Turkish words to the Stop Words list. For examples of what words might be appropriate, take a look at the existing stop words provided in other languages.

Text content language & stop words

Visualize patterns in social media datasets

Cluster analysis enables you to compare similarity of words in social media datasets. For example, you can visualize the similarities and differences across users in a:

Facebook dataset You may discover new insights—for example, how similar are the posts or comments from various users?
Twitter dataset You may find other Twitter users that share similar views to a Twitter account you are researching.