Module 2 case review activity: Clustering

Author

Elizabeth Cloude

Published

June 5, 2025

In our case paper review, (Rodriguez et al. 2021) examined a fully online Chemistry course with a sample of 312 students (70% were first-generation college students). Data were gathered on students’ lecture video clickstream data via learner management system, where they created two self-regulated learning (SRL) variables:

lecture video completion: Proportion of the assigned lecture video students visited across 4 modules (before the due date) and
time management: proportion of late video visits (as the module due date neared) across 4 modules.

The learning outcome was measured using the students’ final grade in the course. Their first research question,

Does clustering clickstream measures of SRL reveal meaningful learning patterns?

was answered using K-means clustering, which we will replicate in our case review by generating random data using the same data structure and variables as (Rodriguez et al. 2021). The authors used a combination of cluster validation techniques (which we have yet to cover extensively) to identify the number of centroids to use. Specifically, they used 1) elbow plot and 2) Bayesian Inference Criterion (BIC).

The elbow plot evaluates the proportion of variance explained based on the number of clusters, and the authors evaluated the point at which including an additional centroid (K) explained little to no variance (in which case the number of K stopped increasing). In contrast, BIC is a statistical metric used to evaluate how well a model fits the data while penalizing complexity (i.e., the number of clusters). It helps determine the optimal number of clusters by balancing goodness-of-fit and model simplicity.

First, we must ensure we have the correct packages properly installed and loaded.

if(!require("tidyverse")){install.packages("tidyverse")}

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyverse)
if(!require("cluster")){install.packages("cluster")}

Loading required package: cluster

library(cluster)
if(!require("factoextra")){install.packages("factoextra")}

Loading required package: factoextra
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(factoextra)
if(!require("mclust")){install.packages("mclust")}

Loading required package: mclust
Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.

Attaching package: 'mclust'

The following object is masked from 'package:purrr':

    map

library(mclust)

Next, we will generate a random data set with a set seed.

# Generate demographic and prior achievement data
gender <- sample(c("Man", "Woman"), N, replace = TRUE)
first_gen <- sample(c("Non-First-Gen", "First-Gen"), N, replace = TRUE)
low_income <- sample(c("Not Low Income", "Low Income"), N, replace = TRUE)
URM <- sample(c("Non-URM", "URM"), N, replace = TRUE)
SAT_scores <- pmax(pmin(round(rnorm(N, 1613.33, 132.94)), 2040), 1260)

# Generate final course grades (1 = F to 13 = A+)
final_grades <- pmax(pmin(round(rnorm(N, 9.34, 2.32)), 13), 1)


student_data <- data.frame(
  Student_ID = 1:N,
  Gender = gender,
  First_Gen = first_gen,
  Low_Income = low_income,
  URM = URM,
  SAT_Score = SAT_scores,
  Proportion_Video_Visits = weighted_proportion_visits,
  Proportion_Late_Video_Visits = weighted_proportion_late_visits,
  Final_Grade = final_grades,
  True_Cluster = cluster_labels  # store the actual cluster assignment
)

head(student_data)

  Student_ID Gender     First_Gen     Low_Income     URM SAT_Score
1          1    Man Non-First-Gen     Low Income     URM      1680
2          2    Man Non-First-Gen Not Low Income Non-URM      1742
3          3    Man Non-First-Gen     Low Income Non-URM      1497
4          4    Man     First-Gen     Low Income Non-URM      1329
5          5  Woman     First-Gen Not Low Income Non-URM      1411
6          6    Man     First-Gen     Low Income Non-URM      1465
  Proportion_Video_Visits Proportion_Late_Video_Visits Final_Grade True_Cluster
1               0.2829612                   0.18741508           9            1
2               0.1572279                   0.16608953           9            1
3               0.2283491                   0.10381919           6            1
4               0.2473177                   0.02693332           8            1
5               0.2313985                   0.14101296          13            1
6               0.1915484                   0.14382245           8            1

The authors reported that they scaled their data, and so we will do the same in the code chunk below.

Question: Similar to Case study 1 in our K-means analysis, why is it good practice to scale data before clustering? What are the possibly implications of the findings if we do not scale numerical data before clustering?

# Select and scale the two SRL clickstream measures
clickstream_data <- student_data %>%
  select(Proportion_Video_Visits, Proportion_Late_Video_Visits) %>%
  scale()

head(clickstream_data)

     Proportion_Video_Visits Proportion_Late_Video_Visits
[1,]               -0.864945                   -0.6795061
[2,]               -1.367889                   -0.7719144
[3,]               -1.083398                   -1.0417454
[4,]               -1.007522                   -1.3749086
[5,]               -1.071201                   -0.8805767
[6,]               -1.230605                   -0.8684025

psych::describe(clickstream_data)

                             vars   n mean sd median trimmed  mad   min  max
Proportion_Video_Visits         1 312    0  1  -0.03    0.00 1.45 -1.60 1.60
Proportion_Late_Video_Visits    2 312    0  1  -0.27   -0.05 1.11 -1.49 1.97
                             range skew kurtosis   se
Proportion_Video_Visits       3.19 0.04    -1.35 0.06
Proportion_Late_Video_Visits  3.46 0.39    -1.14 0.06

In this case, we center an scale the columns, making each have a mean of 0 and SD of 1.

Next, the authors reported that they chose a grouping of K=4 with 24 initial random centroids. Let’s think about this more before moving forward.

Question: Why do you think the authors selected K=4 instead of another value? How might choosing a different K affect the interpretation of student learning behaviors?

Question: What assumptions are being made by forcing the model to group students into exactly 4 clusters?

Now we will run the k-means algorithm in the code chunk below.

# Run k-means clustering with K = 4 and 24 initial random centroids
kmeans_result <- kmeans(clickstream_data, centers = 4, nstart = 24)
kmeans_result

K-means clustering with 4 clusters of sizes 104, 54, 104, 50

Cluster means:
  Proportion_Video_Visits Proportion_Late_Video_Visits
1             -0.03464816                   -0.2180725
2              1.24360070                    0.9421678
3             -1.17366492                   -1.0400823
4              1.17020246                    1.5994208

Clustering vector:
  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 2 4 4 2 2 4 4 2 2 2 4 4 4
[223] 4 4 2 2 4 2 4 4 2 2 4 2 4 2 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 4 4 2 2 4 2 2 2
[260] 4 4 2 4 4 2 2 4 2 4 2 4 2 4 2 2 2 2 2 2 4 2 2 2 2 4 2 2 4 2 4 4 4 4 4 4 4
[297] 4 4 2 4 2 2 4 2 4 4 2 4 2 2 4 4

Within cluster sum of squares by cluster:
[1] 12.502030  3.878039 12.178143  4.783991
 (between_SS / total_SS =  94.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

Take note of the clustering output. The cluster sizes appear to be well distributed into 4 groups: 104, 54, 104, and 50.

Another important measure is the within-cluster sum of squares (WCSS), which indicates how tightly grouped the data points are within each cluster. It is the sum of squared distances between each data point and its cluster centroid. Clusters with higher WCSS values are more dispersed (spread), while those with lower WCSS are more tightly packed (compact). This means:

Cluster 1 has a WCSS of 12.5 → The data points in this cluster collectively deviate from their centroid by 12.5.
Cluster 2 has a WCSS of 3.88 → This smaller value indicates the data points in this cluster are closer together relative to their centroid; hence it’s a more compact cluster.
Cluster 3 has a WCSS of 12.18 → Similar to Cluster 1, a larger WCSS means the points in this cluster are more spread out, increasing the total squared distance.
Cluster 4 has a WCSS of 4.78 → Another relatively compact cluster, meaning most of its data points lie closer to their centroid (and to one another).

We should also refer to the between-cluster sum of squares (between_SS). ‘between_SS’ measures the variance between clusters – i.e., how far apart cluster centroids are. In contrast, the total sum of squares (total_SS) measures the total variance in the dataset before clustering. From these values, we can calculate the degree of variance explained by the four identified clusters by dividing between_SS over total_SS = 94.6%, in this case.

What does this tell us? We know that 95% of the total variance is explained by the clusters, meaning the clustering model does a great job of separating groups using the two SRL variables we have… A higher percentage (e.g., 80-90%) indicates strong separation between clusters compared to lower total_SS.

Question: What percent variance explained was reported by (Rodriguez et al. 2021)? Are the clusters well separated? Are some clusters too dispersed?

Elbow Method

Next, we implement the elbow method to determine the optimal number of centroids.

Unfortunately, the authors did not report the range of K they used to evaluated changes in variance explained by different K values. However, we can still apply this method by using the Elbow Method, with a wider range of centroids. As a result, we apply a minimum of 2 and a maximum of 14, a range that has been proposed in prior studies as suitable to identifying the optimal number of K (Ferguson and Clow 2015) (Kizilcec, Piech, and Schneider 2013).

# Elbow Method
fviz_nbclust(clickstream_data, kmeans, method = "wss") +
  ggtitle("Elbow Method for Optimal K")

An elbow plot (or scree plot) is a visual method used to determine the optimal number of clusters. It helps identify the point at which adding more clusters no longer significantly reduces the variance within clusters.

Similar to (Rodriguez et al. 2021), we found that the optimal number of K is 4.

Next, we will apply BIC using the ‘mclust’ library.

# Bayesian Information Criterion (BIC) for k-means
bic_results <- Mclust(clickstream_data, G = 1:6) # Test clusters from 1 to 6
print(bic_results$BIC)

Bayesian Information Criterion (BIC): 
         EII        VII        EEI        VEI        EVI        VVI        EEE
1 -1786.0611 -1786.0611 -1791.8041 -1791.8041 -1791.8041 -1791.8041 -1198.5125
2 -1298.2548 -1228.5278 -1299.1437 -1234.2153 -1284.6073 -1199.7417 -1101.6119
3  -862.8203  -857.4267  -840.4899  -841.9695  -838.0768  -840.5223  -844.7663
4  -817.6721  -830.6070  -820.8252  -834.9039  -835.1842  -846.6038  -821.9277
5  -822.4088  -828.2193  -827.7412  -837.5818  -844.4830  -835.1938  -829.1449
6  -827.2356  -847.5627  -816.9728  -838.3742  -833.8749  -839.7212  -821.5866
         VEE        EVE        VVE        EEV        VEV        EVV        VVV
1 -1198.5125 -1198.5125 -1198.5125 -1198.5125 -1198.5125 -1198.5125 -1198.5125
2 -1105.1367 -1025.5340 -1005.5040 -1028.6437 -1007.0335 -1015.2878  -985.3043
3  -845.6737  -843.7303  -846.1058  -849.9138  -851.2723  -852.1722  -854.2130
4  -835.0622  -837.6080  -850.6993  -836.1665  -846.7464  -852.1867  -862.9745
5  -838.5478  -850.3554  -840.8517  -847.5258  -851.9133  -866.4728  -857.7772
6  -852.0612  -840.7092  -880.9747  -839.4580  -859.8936  -857.8308  -866.8947

Top 3 models based on the BIC criterion: 
    EEI,6     EII,4     EEI,4 
-816.9728 -817.6721 -820.8252

BIC values are negative because they represent a log-likelihood function with penalties for complexity. Higher (less negative) BIC is better (i.e., -816 is better than -820).

Models EII,6 and EII,4 are tied as the best models (both have the highest BIC scores).

Let’s reflect on our case review paper results.

“Our data showed this occurred between clusters K = 3 and K = 4. The BIC criterion generated 3 choice models, two of which K = 4 (VVI,4 = -1428.62, VVE,4 = -1433.36, VVI,5 = -1439.32).” (p. 317) (Rodriguez et al. 2021).

Our results suggest that 4 or 6 clusters may best fit our data.

What can we conclude?

Base on the graph, there are likely 4 clusters in our data.

# Visualize clustering results
fviz_cluster(kmeans_result, data = clickstream_data, geom = "point") +
  ggtitle("K-Means Clustering of Clickstream Measures (K = 4)")

# Print cluster means for interpretation
print(aggregate(clickstream_data, by = list(Cluster = kmeans_result$cluster), mean))

  Cluster Proportion_Video_Visits Proportion_Late_Video_Visits
1       1             -0.03464816                   -0.2180725
2       2              1.24360070                    0.9421678
3       3             -1.17366492                   -1.0400823
4       4              1.17020246                    1.5994208

Question: What can we see regarding the distinction of the clusters in the visualization?

Make note of data points within clusters that are close to data points from other clusters. Clusters 2 and 4 are nearly touching at some points.

Question: What kind of interpretation might we make about the groupings regarding students’ SRL?

References

Ferguson, Rebecca, and Doug Clow. 2015. “Examining Engagement: Analysing Learner Subpopulations in Massive Open Online Courses (MOOCs).” In Proceedings of the Fifth International Conference on Learning Analytics and Knowledge, 51–58.

Kizilcec, René F, Chris Piech, and Emily Schneider. 2013. “Deconstructing Disengagement: Analyzing Learner Subpopulations in Massive Open Online Courses.” In Proceedings of the Third International Conference on Learning Analytics and Knowledge, 170–79.

Rodriguez, Fernando, Hye Rin Lee, Teomara Rutherford, Christian Fischer, Eric Potma, and Mark Warschauer. 2021. “Using Clickstream Data Mining Techniques to Understand and Support First-Generation College Students in an Online Chemistry Course.” In LAK21: 11th International Learning Analytics and Knowledge Conference, 313–22.