Module 3 case review activity: Clustering validation

Author

Elizabeth Cloude

Published

June 5, 2025

An important task in using clustering methods is validation. Most existing methods present it as a model selection problem, in which the clustering algorithm is run with different values of K, where the best value of K maximizes or minimizes a selected criterion. Validation can also be used to select between different types of clustering, e.g., spectral vs. gmm vs. k-means.

In this case review activity, we will apply k-means clustering methods, similar to the approach used in (Khosravi and Cooper 2017). Since their dataset and code were not made available, we will create from scratch. The authors described their sample as being collected from a large, flipped introductory programming course. We will need to generate our own data using similar variables, conduct k-means clustering and validation, and then interpret the results.

The authors study the specific variables:

Performance: summative (S) and formative (F)
- S features: S1-S3, use a total of 7 scores from summative assessments:
  - S1: (labs) the average lab grade of students for the first 5 labs;
  - S2: (midterm 1) the first midterm grade; and,
  - S3: (midterm 2) the second midterm grade.
- F features: F1 and F2, use a total of 30 scores from formal assessments:
  - F1: (clickers) the average clicker grade over 15 lectures; and,
  - F2: (worksheets) the average grade of students for the in-class exercises over 15 lectures.
Engagement: behavioral (B)
- B features: B1-B4, use a total of 41 scores to represent the number of view of screen casts for the 15 online lectures
  - B1: (screen cast views) the total number of views of screen casts for the 15 lectures;
  - B2: (worksheet solution view) the total number of solution (out of 15);
  - B3: (pre-lab exercise views) the total number of view of the 5 pre-lab exercises; and
  - B4: (examination/solution views) the total number of files out of the four practice questions with solutions for midterms; and two exam solutions for the midterm student access.

We have a lot of features to work with, where each falls across three dimensions: S, F, B, and on the top: performance and engagement categories. These data represents multiple, hierarchical dimensions.

Before we start, let’s ensure we have installed the right packages.

chooseCRANmirror(graphics = FALSE, ind = 1)
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE)

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyverse)
if (!require("cluster")) install.packages("cluster", dependencies = TRUE)

Loading required package: cluster

library(cluster)
if (!require("factoextra")) install.packages("factoextra", dependencies = TRUE)

Loading required package: factoextra
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(factoextra)
if (!require("NbClust")) install.packages("NbClust", dependencies = TRUE)

Loading required package: NbClust

library(NbClust)
if (!require("readr")) install.packages("readr", dependencies = TRUE)
library(readr)
if (!require("dplyr")) install.packages("dplyr", dependencies = TRUE)
library(dplyr)
if (!require("knitr")) install.packages("knitr", dependencies = TRUE)

Loading required package: knitr

library(knitr)
if (!require("kableExtra")) install.packages("kableExtra", dependencies = TRUE)

Loading required package: kableExtra

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library(kableExtra)

Let’s generate random data based on their description of the data structure and variables.

We will normalize the scores, putting them on a similar scale before conducting the clustering analysis. The authors scaled the dimensions by transforming them into normalized values with a mean of 0 and standard deviation of 1.

By transforming each variable to have a mean of 0 and standard deviation of 1, every dimension contributes equally to the distance calculation. Without proper scaling, the variable(s) with larger numerical ranges could dominate the clustering, potentially obscuring patterns in other dimensions.

# Set seed for reproducibility
set.seed(123)

# Define number of students
n_students <- 78

# Generate random normalized data for each feature
generate_scaled_data <- function(n) {
  scale(rnorm(n, mean = 0, sd = 1))
}

# Generate Final Exam scores in range 0-100
generate_final_exam_score <- function(n) {
  runif(n, min = 0, max = 100)  # Uniform distribution between 0 and 100
} # the authors did not provide average or SD information to estimate the distribution

# Create a data frame
student_data <- data.frame(
  Student_ID = 1:n_students,
  S1 = generate_scaled_data(n_students),  # Average lab grade (first 5 labs)
  S2 = generate_scaled_data(n_students),  # Midterm 1 grade
  S3 = generate_scaled_data(n_students),  # Midterm 2 grade
  F1 = generate_scaled_data(n_students),  # Clicker grade (15 lectures)
  F2 = generate_scaled_data(n_students),  # Worksheet grade (15 lectures)
  B1 = generate_scaled_data(n_students),  # Screencast views (15 lectures)
  B2 = generate_scaled_data(n_students),  # Worksheet solution views (out of 15)
  B3 = generate_scaled_data(n_students),  # Pre-lab exercise views (5 pre-labs)
  B4 = generate_scaled_data(n_students),   # Examination solution views (out of 6)
  Final_Exam = generate_final_exam_score(n_students)
)

head(student_data)

  Student_ID         S1          S2         S3         F1         F2         B1
1          1 -0.6197726  0.25429474  0.5355729 -0.7067900 -1.6978453 -0.6043625
2          2 -0.2666170 -0.08331887 -0.4433603 -1.4610635 -0.5896007 -2.0428259
3          3  1.6460641  0.06920601  0.9688103 -0.3091876 -0.3439047  0.8538083
4          4  0.0548771  0.46936811 -0.4456021  0.3021595  0.6957375 -0.7780826
5          5  0.1177241 -0.32769588  1.0480708  0.2059967 -0.1012473 -0.5343864
6          6  1.8132411  0.74255922 -1.1515723 -0.9171863 -1.2368267  1.4438673
          B2         B3         B4 Final_Exam
1 -0.0473209 -1.0392236 -0.7071205  24.412509
2  1.2001355  0.7399667  1.0653019  75.818554
3  2.2010870  2.4732960  0.3133897  54.730268
4 -0.4553162 -0.4183782 -0.8908032  94.806690
5 -1.6811393  0.9387940 -1.2074927   8.617219
6 -0.4057280 -0.6361522  2.1831050  94.788258

Determining the Optimal Number of Clusters

The paper by (Khosravi and Cooper 2017) applied two cluster validation methods. First, a gap statistic was used to measure clusters based on properties of internal cohesion and external separation (Tibshirani, Walther, and Hastie 2001). The gap statistic helps identify an initial range for K by comparing the clustering structure in the data to that of randomly distributed points. They explored a minimum of 2, and a maximum of 14 clusters, a guideline provided by a prior study (Thorndike 1953), but we will assess if that is optimal for our randomly generated data. The second method used was the ‘elbow’ method to evaluate the sum of square errors (SSE) for a range of values of K.

To account for random initialization of centroids in k-means, for each value in the range, the authors ran 100 executions of the k-means algorithm and the solution with the highest likelihood is selected. While the authors did not discuss the gap statistic output, we will run it for teaching purposes.

Gap statistic method (Tibshirani, Walther, and Hastie 2001)

k_min <- 2
k_max <- 14
k_range <- k_min:k_max

gap_stat <- clusGap(student_data %>% select(-Student_ID,-Final_Exam), 
                    FUN = kmeans, 
                    K.max = k_max, nstart=100)
fviz_gap_stat(gap_stat)

Based on the gap statistic, it appears that 1 one cluster is optimal, suggesting that there may not be distinct groups within our dataset (possibly due to generating a normal distribution or small sample size of 78). Next, the authors used the elbow method to determine the optimal number of clusters (K) by running k-means for K values. We will apply this with a range [2, 14] for teaching purposes, even though this grouping may be artificial based on our randomly generated dataset.

Elbow method

We will run a loop to calculate k-means using 2-14 clusters.

k_range <- 2:14
sse_values <- sapply(k_range, function(k) {
  kmeans_result <- kmeans(student_data %>% select(-Student_ID,-Final_Exam), centers = k, nstart = 100)
  return(kmeans_result$tot.withinss)
})

Next, we evaluate the k range using an elbow plot.

# Create elbow plot data
elbow_plot <- data.frame(K = k_range, SSE = sse_values)
ggplot(elbow_plot, aes(x = K, y = SSE)) +
  geom_point() +
  geom_line() +
  ggtitle("Elbow Method for Optimal K") +
  xlab("Number of Clusters (K)") +
  ylab("Sum of Squared Errors (SSE)")

📌 Question: How many clusters should we indicate for k-means based on the elbow plot?

# Suppose we pick k=4 from the elbow or gap; could also be 5
k <- 4 
kmeans_result <- kmeans(student_data %>% select(-Student_ID, -Final_Exam), centers = k, nstart = 100)
student_data$Cluster <- factor(kmeans_result$cluster)
kmeans_result

K-means clustering with 4 clusters of sizes 23, 19, 20, 16

Cluster means:
          S1         S2         S3          F1         F2         B1         B2
1 -0.3603077  0.3117187 -0.7996153 -0.08791758 -0.5085279  0.7041883 -0.3118723
2  0.1783690  0.2629476  0.6788241 -0.72624417 -0.1658799 -0.6840823  0.1026723
3  0.6212869 -0.3781287 -0.2872137  0.42200322  0.3519330 -0.6785556 -0.3967000
4 -0.4704795 -0.2876852  0.7023604  0.46129246  0.4880752  0.6482716  0.8222680
          B3         B4
1 -0.2607151  0.5184058
2 -0.7216142  0.3951152
3  0.3064278 -0.9058394
4  0.8486600 -0.0821083

Clustering vector:
 [1] 2 2 4 3 3 1 3 4 4 2 3 3 2 2 1 3 2 4 2 1 4 4 4 1 1 2 3 1 1 3 4 4 2 1 2 2 1 1
[39] 1 2 3 1 1 3 2 2 1 4 3 1 3 3 4 3 3 3 4 1 3 2 1 1 1 1 1 3 4 2 4 2 1 1 2 3 2 3
[77] 4 4

Within cluster sum of squares by cluster:
[1] 148.4310 125.2280 117.2691 113.4019
 (between_SS / total_SS =  27.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

📌 Question: What can we gather from the variance explained?

Silhouette method

Another common validation method is Silhouette analysis, a method that the authors did not utilize in the paper. Silhouette analysis is a way to evaluate how well-separated the clusters are in a clustering solution. For each data point, it measures:

How close the point is to others in the same cluster
How far the point is from points in the nearest different cluster

The silhouette width for any data point generally falls in the range −1 to −1:

+1: indicates a perfectly separated point, i.e., it’s much closer to its own cluster than any other cluster.
0: indicates a boundary point equally close to its own cluster as to a neighboring cluster.
−1: suggests the point is assigned to the wrong cluster, as it’s closer to a different cluster than its own.

dist_matrix <- dist(student_data %>% select(-Student_ID, -Final_Exam, -Cluster), method = "euclidean")

sil <- silhouette(kmeans_result$cluster, dist_matrix)

mean_sil_width <- summary(sil)$avg.width
cat("Mean Silhouette Width for k =", k, "is", mean_sil_width, "\n")

Mean Silhouette Width for k = 4 is 0.1128031

plot(sil, main = paste("Silhouette plot (k =", k, ")"))

📌 Question: What can we gather from the silhouette visualization? How close are the clusters to each other?

The average silhouette width is quite low (.11). This indicates that our clusters are not well separated. Since it is near 0, it indicates that most points are only slightly closer to their own cluster’s center than to centers of other clusters.

Next, (Khosravi and Cooper 2017) created a visualization of the cluster groupings with final exam score variables (which we excluded from our k-means analysis).

ggplot(student_data, aes(x = Cluster, y = Final_Exam, fill = Cluster)) +
  geom_boxplot() +
  ggtitle("Boxplot of Final Exam Grades by Cluster") +
  xlab("Cluster") +
  ylab("Final Exam Score") +
  theme_minimal()

Groups 1 and 3 appear to outperform groups 2 and 4 regarding final exam scores.

Cluster Interpretation

The authors provided the median (but I added the average and SD).

# Compute summary statistics for Final Exam scores by Cluster
summary_table <- student_data %>%
  group_by(Cluster) %>%
  summarise(
    Mean = mean(Final_Exam),
    Std_Dev = sd(Final_Exam),
    Median = median(Final_Exam)
  )

summary_table %>%
  kable("html", caption = "Final Exam Scores by Cluster") %>%
  kable_styling(full_width = FALSE)

Final Exam Scores by Cluster
Cluster	Mean	Std_Dev	Median
1	57.46593	34.01090	72.52994
2	46.45892	22.59861	47.00026
3	63.64206	27.36462	72.75197
4	54.33292	25.96303	56.39156

# Compute summary statistics for all normalized variables by Cluster
summary_table <- student_data %>%
  group_by(Cluster) %>%
  summarise(
    Mean_S1 = mean(S1), Std_S1 = sd(S1), Median_S1 = median(S1),
    Mean_S2 = mean(S2), Std_S2 = sd(S2), Median_S2 = median(S2),
    Mean_S3 = mean(S3), Std_S3 = sd(S3), Median_S3 = median(S3),
    Mean_F1 = mean(F1), Std_F1 = sd(F1), Median_F1 = median(F1),
    Mean_F2 = mean(F2), Std_F2 = sd(F2), Median_F2 = median(F2),
    Mean_B1 = mean(B1), Std_B1 = sd(B1), Median_B1 = median(B1),
    Mean_B2 = mean(B2), Std_B2 = sd(B2), Median_B2 = median(B2),
    Mean_B3 = mean(B3), Std_B3 = sd(B3), Median_B3 = median(B3),
    Mean_B4 = mean(B4), Std_B4 = sd(B4), Median_B4 = median(B4),
  )

summary_table %>%
  kable("html", caption = "Normalized Values of Student Features (S1-3, F1-2, B1-4) by Cluster") %>%
  kable_styling(full_width = FALSE)

Normalized Values of Student Features (S1-3, F1-2, B1-4) by Cluster
Cluster	Mean_S1	Std_S1	Median_S1	Mean_S2	Std_S2	Median_S2	Mean_S3	Std_S3	Median_S3	Mean_F1	Std_F1	Median_F1	Mean_F2	Std_F2	Median_F2	Mean_B1	Std_B1	Median_B1	Mean_B2	Std_B2	Median_B2	Mean_B3	Std_B3	Median_B3	Mean_B4	Std_B4	Median_B4
1	-0.3603077	0.8825635	-0.4512761	0.3117187	1.0922239	0.3148471	-0.7996153	0.4067517	-0.7909660	-0.0879176	0.9851686	-0.0754996	-0.5085279	0.9279212	-0.4541564	0.7041883	0.6390962	0.6904529	-0.3118723	0.9877097	-0.3620482	-0.2607151	0.6858903	-0.3176950	0.5184058	0.9609842	0.6188367
2	0.1783690	0.9502891	0.2103746	0.2629476	0.9065340	0.2542947	0.6788241	0.8883907	0.5355729	-0.7262442	0.7952441	-0.7067900	-0.1658799	1.0053031	-0.0297082	-0.6840823	0.7539766	-0.5012948	0.1026723	0.9511016	0.2456422	-0.7216142	0.9005715	-0.9317416	0.3951152	0.7182412	0.4496494
3	0.6212869	0.8450670	0.4182526	-0.3781287	0.8154797	-0.3550243	-0.2872137	0.6808003	-0.3397292	0.4220032	0.7981226	0.2373418	0.3519330	0.9240171	0.3534924	-0.6785556	0.8915587	-0.5684547	-0.3967000	0.7027482	-0.3584397	0.3064278	0.9871054	0.4596658	-0.9058394	0.7586214	-1.0197314
4	-0.4704795	1.0062580	-0.4277308	-0.2876852	1.0196416	-0.2851518	0.7023604	1.0752894	0.4590409	0.4612925	0.9998655	0.6349509	0.4880752	0.8518282	0.2901938	0.6482716	0.6557132	0.6787384	0.8222680	0.9475838	0.8944140	0.8486600	0.7606923	0.7307737	-0.0821083	0.8484031	-0.0411617

📌 Question: What conclusions can we draw about student groups based on k-means analysis?

Research Question 2: Overly engaged student subpopulations

In the paper, (Khosravi and Cooper 2017) looked further at extreme subpopulations of students’ behavior “engagement” variables: B1-B4.

Overly engaged participants: those with the highest number of interactions with online materials. Students with the highest 20% of the average, behavioral values were selected as the subpopulation of study. We do so in the code chunk below:

# Compute the average of E1, E2, and E3 for each student
student_data <- student_data %>%
  mutate(Average_B = (B1 + B2 + B3 + B4) / 4)

# Determine the top 20% of students
top_n <- ceiling(0.20 * nrow(student_data))

# Filter students who have the highest Average_B values
top_students <- student_data %>%
  arrange(desc(Average_B)) %>%
  slice_head(n = top_n)  # Select top 20% students

top_students

   Student_ID          S1          S2          S3         F1          F2
1           3  1.64606414  0.06920601  0.96881030 -0.3091876 -0.34390475
2          67  0.45871636 -1.62553232  1.10786479 -0.8498524 -0.01126205
3          69  0.96558001 -1.47814760 -0.43417046 -1.0766507  0.03685918
4          18 -2.12322258 -0.56978577  2.17384180 -0.6940814  0.01000956
5          31  0.43546595 -0.33778279  1.10794030  2.2056463  1.64866877
6           9 -0.75489542  1.21963588 -0.48984499  1.3961289  0.23667702
7          25 -0.68880418 -0.19698379 -1.16637951 -0.2388155 -1.34062521
8          21 -1.16223040 -0.18539433 -0.01405432  1.5001836 -0.70788625
9           6  1.81324108  0.74255922 -1.15157234 -0.9171863 -1.23682667
10         57 -1.67644098 -2.10181669  1.23619571  1.0633134  0.84682186
11         15 -0.61481737  0.31484712 -0.27899761 -0.7985881 -0.05707471
12         77 -0.32499065 -0.06282264  0.17091642  1.1981880 -0.42624229
13         53 -0.06634785  1.58626343  1.67408772  0.1010921  0.80815077
14         38 -0.08670680  0.38066451 -0.99060512  0.8542292 -0.17436205
15         68  0.03616160 -0.49665986  0.68689670 -0.8878864 -1.76043716
16         39 -0.34764669  0.17455329 -1.42536450  0.5716475  1.00245757
            B1         B2          B3          B4 Final_Exam Cluster Average_B
1   0.85380833  2.2010870  2.47329601  0.31338974  54.730268       4 1.4603953
2   1.06904094  2.1662853  1.75446544  0.05771431  70.017861       4 1.2618765
3   0.23138801  1.2475111  1.16146048  1.30967108  73.303999       4 0.9875077
4   0.73206904  0.7927208  1.55743233  0.41143454  58.052849       4 0.8734142
5   1.70630851  0.3045600  0.77264622  0.61591727  38.692570       4 0.8498580
6  -1.18869610  1.8913982  1.71993495  0.66727609  61.384094       4 0.7724783
7  -0.38213772  1.3991463  1.05142381  0.86980367  92.527287       1 0.7345590
8   0.48050673  1.7908533  0.63687808 -0.23927680  46.269884       4 0.6672403
9   1.44386734 -0.4057280 -0.63615217  2.18310497  94.788258       1 0.6462730
10  0.42964740  1.1879485  0.08871287  0.51997841  95.895040       4 0.5565718
11  0.65045159  1.4393187 -0.68613844  0.76764909  77.279440       1 0.5428202
12  0.71730311  0.8957167  0.94027696 -0.56364578  76.164434       4 0.4974127
13  0.64017368  1.0144787  1.26429071 -0.94583876  23.182380       4 0.4932761
14  0.02238179 -0.2359978  0.60364739  1.49400832  61.276590       1 0.4710099
15 -0.45004341  0.8186791  0.10548521  1.30536355   6.557490       2 0.4448711
16  1.56734299 -1.1785507 -0.88467603  2.06555367   4.196682       1 0.3924175

Again, the authors used the elbow method to select their number of centroids. But first we will apply the gap statistic to identify our K range.

k_min <- 2
k_max <- 14
k_range <- k_min:k_max

gap_stat <- clusGap(top_students %>% select(-Student_ID,-Final_Exam, -Cluster, -Average_B), 
                    FUN = kmeans, 
                    K.max = k_max, nstart=100)
fviz_gap_stat(gap_stat)

Not surprising, but again, we are seeing that the data set lacks student groupings. Let’s try now with the elbow method with the 2:14 k range.

# Compute k-means clustering for K values from 2 to 14
k_range <- 2:14
sse_values <- sapply(k_range, function(k) {
  kmeans_result <- kmeans(top_students %>% select(--Student_ID,-Final_Exam, -Cluster, -Average_B), centers = k, nstart = 100)
  return(kmeans_result$tot.withinss)
})

Next, we evaluate the k range using an elbow plot.

# Create elbow plot data
elbow_plot <- data.frame(K = k_range, SSE = sse_values)

# Plot Elbow Method
ggplot(elbow_plot, aes(x = K, y = SSE)) +
  geom_point() +
  geom_line() +
  ggtitle("Elbow Method for Optimal K") +
  xlab("Number of Clusters (K) for overly engaged participants") +
  ylab("Sum of Squared Errors (SSE)")

📌 Question: How many clusters should we select based on the elbow plot?

optimal_k <- 5 # could also be 4

# Perform k-means clustering
kmeans_result <- kmeans(top_students %>% select(-Student_ID,-Final_Exam,-Cluster, -Average_B), centers = optimal_k, nstart = 100)
top_students$Cluster <- factor(kmeans_result$cluster)

We will now display the Final Exam Scores by each cluster.

ggplot(top_students, aes(x = Cluster, y = Final_Exam, fill = Cluster)) +
  geom_boxplot() +
  ggtitle("Boxplot of Final Exam Grades by Overly Engaged Participant Clusters") +
  xlab("Cluster") +
  ylab("Final Exam Score") +
  theme_minimal()

📌 Question: What can we infer about overly engaged participants and their final grades?

To identify subgroups within clusters, there are more advanced clustering techniques we can use. We will learn more about this in the next module.

References

Khosravi, Hassan, and Kendra ML Cooper. 2017. “Using Learning Analytics to Investigate Patterns of Performance and Engagement in Large Classes.” In Proceedings of the 2017 Acm Sigcse Technical Symposium on Computer Science Education, 309–14.

Thorndike, Robert L. 1953. “Who Belongs in the Family?” Psychometrika 18 (4): 267–76.

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2001. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2): 411–23.