Module 4 case review activity: Advanced Clustering

Author

Elizabeth Cloude

Published

June 5, 2025

In this case review activity, we will apply the same methods applied by the case review paper (Scrucca et al. 2023). We will use the data set and code provided in the paper, which consists of data collected from primary school students in northern Spain. Data were gathered on students’ school engagement, self-regulation, and academic performance through the use of various validated measures. The school engagement measure (SEM) was employed to assess students’ engagement, while their self-regulation was evaluated with the self-regulation strategy inventory—self-report. The measure for academic achievement was based on the students’ self-reported grades in Spanish and mathematics, which were rated on a scale of 1 to 5.

We will use these data to identify “soft” clusters of students based on their engagement and self-regulation using an advanced clustering technique called Gaussian mixture modeling (GMM). Unlike k-means clustering, which groups observations based on distance, GMM is a model-based clustering approach that assumes data come from a mixture of probability distributions. Each cluster corresponds to a different probability distribution (often a multi-dimensional Gaussian, e.g., circle, sphere, etc., but could be others).

When a data point is assigned to a cluster, the assumptions is that it fits well within that cluster’s probability distribution. As we have discussed, compared to other clustering methods, a key advantage of GMM is its ability to apply model selection criteria and inferential procedures to evaluate and interpret the results (as opposed to silhouette or elbow methods).

To start, we should ensure that we have the appropriate packages installed.

if (!require("ggplot2")) install.packages("ggplot2", dependencies = TRUE)

Loading required package: ggplot2

library(ggplot2)
if (!require("ggridges")) install.packages("ggridges", dependencies = TRUE)

Loading required package: ggridges

library(ggridges)
if (!require("mclust")) install.packages("mclust", dependencies = TRUE)

Loading required package: mclust

Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.

library(mclust)
if (!require("rio")) install.packages("rio", dependencies = TRUE)

Loading required package: rio

library(rio)
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE)

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ purrr::map()    masks mclust::map()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyverse)

Then, we read and view the data set from an online comma-separated-value (CSV) file.

# read the data
data <- import("https://raw.githubusercontent.com/sonsoleslp/labook-data/main/3_engSRLach/Manuscript_School%20Engagment.csv", sep = ";")
head(data)

  alumno.pret sexo.pret coleg.pret curso.pret grup.pret pais.pret ren.mat.pret
1           1         2          1          6         1         1            4
2           2         1          1          6         1         1            4
3           3         2          1          6         1         1            4
4           4         2          1          6         1         1            3
5           5         1          1          6         1         1            5
6           6         2          1          6         1         1            3
  ren.leng.pret ECE1.pret ECE2.pret ECE3.pret ECE4.pret ECE5.pret ECE6.pret
1             4         1         3         4         4         5         4
2             3         2         5         4         4         1         4
3             4         3         4         5         3         4         5
4             3         2         4         4         4         3         3
5             3         3         4         4         4         4         4
6             4         4         5         5         4         4         3
  ECE7.pret ECE8.pret ECE9.pret ECE10.pret ECE11.pret ECE12.pret ECE13.pret
1         4         4         2          1          4          5          4
2         2         1         1          4          5          4          5
3         4         3         5          2          4          3          5
4         3         3         1          3          2          4          4
5         3         5         1          3          4          1          5
6         4         4         3          2          4          3          3
  ECE14.pret ECE15.pret ECE16.pret ECE17.pret ECE18.pret ECE19.pret IEA1.pret
1          2          1          4          5          3          5         4
2          4          2          4          2          4          5         5
3          4          2          4          5          4          4         4
4          2          2          3          3          4          3         4
5          4          2          4          4          5          4         5
6          4          1          3          4          4          3         4
  IEA2.pret IEA3.pret IEA4.pret IEA5.pret IEA6.pret IEA7.pret IEA8.pret
1         4         5         5         5         3         5         5
2         5         5         4         4         4         2         3
3         5         5         4         4         4         4         2
4         4         4         2        NA         5         4         4
5         5         4         4         5         4         1         4
6         5         5         5         5         3         4         3
  IEA9.pret IEA10.pret IEA11.pret IEA12.pret IEA13.pret IEA14.pret IEA15.pret
1         3          2          1          3          1          1          5
2         4          1          5          1          4          4          4
3         4          1          3          1          2          3          4
4         3          2          2          2          3          4          4
5         5          1          3          2          1          4          5
6         5          2          4          1          5          4          4
  IEA16.pret IEA17.pret IEA18.pret IEA19.pret IEA20.pret IEA21.pret IEA22.pret
1          5          4          5          1          2          4          3
2          5          4          3          1          1          3          4
3          5          3          4          1          3          4          4
4          4          2          3          2          2          1          4
5          4          4          4          1          1          4          3
6          5          4          3          4          2          4          4
  IEA23.pret IEA24.pret IEA25.pret IEA26.pret IEA27.pret IEA28.pret
1          2          4          4          1          4          4
2          1          4          5          2          3          5
3          1          3          5          1          4          5
4          2          3          4          2          3          4
5          1          4          4          3          4          3
6          2          4          4          1          4          5
  PRE_Emotion_Engage PRE_Cognitive_Engage PRE_Behavior_Engage
1          0.8989251           0.25057461         -0.66817340
2         -2.4712139           1.50596375          0.55572643
3          0.2614537           1.43763063         -0.49791451
4         -0.5421392          -0.19963989         -0.48553522
5          0.5337485          -0.05256182         -0.04035667
6          0.1008994           1.02610226         -0.37250808
  PRE_Enviroment_Manage PRE_Information_help_Manage PRE_Maladapative_Behavior
1            -0.1483424                   0.2489923                0.08695216
2             0.6627681                   0.3436559               -0.79488941
3             0.3090213                   0.1281156               -0.13740548
4            -0.0706820                  -0.7501752                0.31906640
5             0.1241936                   0.2688936               -0.77730497
6             0.3292036                   0.6919219                0.93367664
  PRE_Time_Manage PRE_ENG_EMOC PRE_ENG_COGN PRE_ENG_COND ZPRE_ENG_EMOC
1      0.98831915          4.4     3.142857         3.75     0.8673544
2     -0.26566677          2.0     3.714286         4.00    -1.7669537
3      0.01466829          4.0     3.857143         4.25     0.4283031
4      0.23095294          3.0     2.571429         3.75    -0.6693253
5      0.92229373          4.0     3.000000         4.25     0.4283031
6      0.29047188          3.8     3.714286         4.00     0.2087774
  ZPRE_ENG_COGN ZPRE_ENG_COND LPA_4perfiles LPA_3perfiles Gestión_Entorno
1     0.2935262    -0.6719753             3             2      -0.4744143
2     1.0343818    -0.2729900             3             2       0.8232487
3     1.2195958     0.1259954             4             1       0.3894974
4    -0.4473295    -0.6719753             3             2              NA
5     0.1083122     0.1259954             3             2      -0.5943995
6     1.0343818    -0.2729900             3             2       0.8720591
  Gestión_Tiempo Desadaptativo Gesti_Infor enviroment_manag time_manag
1     0.65238409    -0.1681630  0.51684968              4.4   3.888889
2    -0.29838477    -0.3290511  0.04871885              4.0   3.888889
3     0.09308684    -0.4062831  0.03609671              4.4   3.888889
4             NA            NA          NA              3.4   3.666667
5     0.35619886    -0.3954653  0.66549408              3.8   3.888889
6     0.22532967     0.9595857  0.28181184              4.4   3.888889
  maladaptative_behavior informati_manag
1               1.571429        4.142857
2               2.142857        4.285714
3               1.714286        4.000000
4               2.142857              NA
5               1.714286        4.428571
6               2.714286        4.428571

We will select the variables we are interested in clustering. In this case, we will select behavioral, cognitive, and emotional engagement as the authors have done.

# select the variables to be analyzed
vars <- c("PRE_ENG_COND", "PRE_ENG_COGN", "PRE_ENG_EMOC")
x <- select(data, all_of(vars)) |>
as_tibble() |>
rename("BehvEngmnt" = "PRE_ENG_COND", # Behavioral engagement
"CognEngmnt" = "PRE_ENG_COGN", # Cognitive engagement
"EmotEngmnt" = "PRE_ENG_EMOC") # Emotional engagement

Print the data set to ensure the correct variables are included.

# A tibble: 717 × 3
   BehvEngmnt CognEngmnt EmotEngmnt
        <dbl>      <dbl>      <dbl>
 1       3.75       3.14        4.4
 2       4          3.71        2  
 3       4.25       3.86        4  
 4       3.75       2.57        3  
 5       4.25       3           4  
 6       4          3.71        3.8
 7       3.5        2.14        3.2
 8       4.75       3.57        1.6
 9       3.25       2.71        3  
10       5          4.43        4.8
# ℹ 707 more rows

Check for any missing data points before we go further.

sum(is.na(x))

[1] 0

Since there are no missing data points, we will generate a summary of the variables using descriptive statistics.

x |> pivot_longer(cols = colnames(x)) |>
group_by(name) |>
summarize(N = n(),
Mean = mean(value),
SD = sd(value),
Min = min(value),
Median = median(value),
Max = max(value))

# A tibble: 3 × 7
  name           N  Mean    SD   Min Median   Max
  <chr>      <int> <dbl> <dbl> <dbl>  <dbl> <dbl>
1 BehvEngmnt   717  4.17 0.627     1   4.25     5
2 CognEngmnt   717  2.92 0.771     1   2.92     5
3 EmotEngmnt   717  3.61 0.911     1   3.61     5

To begin our analysis, similar to k-means, we must identify the optimal number of clusters. In this case, we can use an approach that evaluates the optimal number of clusters using Bayesian Information Criterion (BIC), opposed to other methods we have used such as silhouette and elbow methods. In the code chunk below, we are going to calculate multiple GMMs to our dataset with clusters that range from 1:9 (this is the default setting for mclustBIC function). In this way, it tries a different number of clusters and covariance structures and calculates the BIC value for each model. This model selection criterion jointly takes into account both the covariance decompositions and the number of mixture components in the model, such that the model balances fit (likelihood) and complexity (number of parameters). A lower BIC indicates a better fit model compared to higher BIC values.

Given the characteristics of the data, which consists of a small number of unique values (i.e., 1-5) relative to the number of observations (i.e., n=717), a ‘prior’ is used for regularization to the GMM estimation process. This help in stabilizing the clustering results, especially when working with small datasets. Regularization helps with preventing over fitting by adding a prior belief about cluster parameters. In this specific case, the priorControl adds a prior distribution to the GMM estimation.

BIC <- mclustBIC(x, prior = priorControl())
summary(BIC)

Best BIC values:
             VVI,3        VVI,4       VVV,3
BIC      -4521.213 -4526.905884 -4533.57166
BIC diff     0.000    -5.693183   -12.35896

plot(BIC)

The VVI and VVV values correspond to the covariance structure used in the GMMs. Specifically, the ‘VVI’ indicates the diagonal covariance matrix, while the ‘VVV’ values correspond to the full covariance matrix. The BIC score is lowest for VVI,3, meaning this model provides the best balance of model fit and complexity. The selected model is a three-component GMM with diagonal covariance matrices of varying volume and shape, with axis-aligned orientation. Thus, the variables are independent within each cluster.

VVI,4 (4 clusters) had slightly worse BIC (-6), meaning it was not as optimal as VVI,3, but close.

VVV,3 (3 clusters with fully unconstrained covariance structure) had a much worse BIC (-12), making it less favorable.

The fit of the optimal model is obtained using the code below, where we indicate the number of clusters using ‘G’.

mod <- Mclust(x, modelNames = "VVI", G = 3, prior = priorControl())
summary(mod, parameters = TRUE)

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VVI (diagonal, varying volume and shape) model with 3 components: 

Prior: defaultPrior() 

 log-likelihood   n df       BIC       ICL
      -2194.856 717 20 -4521.213 -4769.174

Clustering table:
  1   2   3 
184 119 414 

Mixing probabilities:
        1         2         3 
0.2895147 0.1620776 0.5484078 

Means:
               [,1]     [,2]     [,3]
BehvEngmnt 3.704041 4.713234 4.257355
CognEngmnt 2.287057 3.699530 3.017293
EmotEngmnt 2.738969 4.733899 3.737286

Variances:
[,,1]
           BehvEngmnt CognEngmnt EmotEngmnt
BehvEngmnt  0.5022148  0.0000000  0.0000000
CognEngmnt  0.0000000  0.3909235  0.0000000
EmotEngmnt  0.0000000  0.0000000  0.7674268
[,,2]
           BehvEngmnt CognEngmnt EmotEngmnt
BehvEngmnt  0.0737948  0.0000000 0.00000000
CognEngmnt  0.0000000  0.4150514 0.00000000
EmotEngmnt  0.0000000  0.0000000 0.05540526
[,,3]
           BehvEngmnt CognEngmnt EmotEngmnt
BehvEngmnt  0.2048374  0.0000000  0.0000000
CognEngmnt  0.0000000  0.3327557  0.0000000
EmotEngmnt  0.0000000  0.0000000  0.2795838

The output reports some basic information about the fit, such as the maximized log-likelihood (log-likelihood; a measure of how well the model fits the data (higher is better)), the number of observations/data points (n), the number of estimated parameters (df), the BIC criterion (BIC), and the clustering table. It appears that cluster 3 has the highest number of observations, followed by clusters 1 and 2.

We also see the Mixing Probabilities, which let us know the probability of each data point belonging to each cluster. We can see that Cluster 3 is the most common, while Cluster 2 has the smallest probability. The sum of these probabilities should equal 1.

Finally, the cluster information is presented last, following the order of Means and Variances. When referring to the Means table, we can see that the table displays the mean values (centroids) for each engagement variable per cluster.

Question: What can you gather from the GMM output regarding the three clusters and the engagement variables?

Now let’s generate a visualization of the fitted GMM model.

plot(mod, what = "classification")

The estimated model identifies three clusters of varying size. The third group (green triangles) accounts for more than 50% of the observations, while the first (blue points) and the second (red open squares) account for approximately 29% and 16%, respectively.

Question: Consider what you have learned about k-means up to this point. How is GMM different than kmeans based on this graph?

# collect estimated means
means <- data.frame(Profile = factor(1:mod$G),
t(mod$parameters$mean)) |>
pivot_longer(cols = -1,
names_to = "Variable",
values_to = "Mean")

# convert variable names to factor
means$Variable <- factor(means$Variable,
levels = colnames(mod$data))

# add mixing probabilities corresponding to clusters
means <- means |>
add_column(MixPro = mod$parameters$pro[means$Profile])
means

# A tibble: 9 × 4
  Profile Variable    Mean MixPro
  <fct>   <fct>      <dbl>  <dbl>
1 1       BehvEngmnt  3.70  0.290
2 1       CognEngmnt  2.29  0.290
3 1       EmotEngmnt  2.74  0.290
4 2       BehvEngmnt  4.71  0.162
5 2       CognEngmnt  3.70  0.162
6 2       EmotEngmnt  4.73  0.162
7 3       BehvEngmnt  4.26  0.548
8 3       CognEngmnt  3.02  0.548
9 3       EmotEngmnt  3.74  0.548

The different engagement behavior of the three identified clusters can be illustrated using a latent profiles plot of the estimated means with point sizes proportional to the estimated mixing probabilities.

ggplot(means, aes(x = Variable, y = Mean,
group = Profile,
shape = Profile,
color = Profile)) +
geom_point(aes(size = MixPro)) +
geom_line(linewidth = 0.5) +
labs(x = NULL, y = "Cluster means") +
scale_color_manual(values = mclust.options("classPlotColors")) +
scale_size(range = c(1, 3), guide = "none") +
theme_bw() +
theme(legend.position = "right")

From the visualization, we can see that cluster 2 has the highest engagement scores for all three variables, despite it being the smallest cluster compared to clusters 1 and 3. In contrast, all engagement scores are lower for the largest cluster (cluster 3), which are all lower for cluster 1.

All three clusters exhibit the lowest mean scores for the cognitive engagement variable compared to behavioral and emotional engagement variables. For cluster 2, behavioral engagement and emotional engagement scores are comparable, whereas for the other two clusters, the mean scores for these variables are lower than those for the behavioral engagement variable. Taken together, we could characterize clusters 1, 3, and 2 as “low”, “medium”, and “high” engagement profiles, respectively.

Question: If we used a different clustering method (e.g., K-means instead of GMM), do you think we would see similar engagement profiles? Why or why not?

To provide a evaluate the stability of the model parameters of the results presented in the previous graph, it is beneficial to incorporate a measure of uncertainty for the estimated means of the GMM. This can be achieved by bootstrap resampling using the function MclustBootstrap(). Bootstrap resampling is a statistical technique used to estimate the variability (uncertainty) of a sample statistic by generating multiple resampled datasets from the original data. This command performs bootstrap resampling on our fitted GMM to assess the stability of cluster parameters (such as means, variances, and mixing probabilities). Bootstrapping resamples the data with replacement data and refits the GMM model multiple times. This helps estimate how stable the mixing weights are across different samples, usually within a 95% confidence interval. If the mixing weights change significantly across bootstrap samples, it suggests that the clustering model may not be stable.

In this case, we will generate 999 bootstrap resamples from our data.

boot <- MclustBootstrap(mod, type = "bs", nboot = 999)

The bootstrap distribution of the mixing weights (the probability that each data point belongs to a specific cluster) can be visualized using histograms with the code below:

par(mfcol = c(1, 3), mar = c(4, 4, 1, 1), mgp = c(2, 0.5, 0))
plot(boot, what = "pro", xlim = c(0, 1))

We can also bootstrap the distribution of the components means with the code below:

par(mfcol = c(3, 3), mar = c(4, 4, 1, 1), mgp = c(2, 0.5, 0))
plot(boot, what = "mean", conf.level = 0.95)

In all the graphs above, the GMM estimates are shown as dashed vertical lines, while the horizontal segments represent the percentile intervals at the 95% confidence level. If a histogram is narrow (less spread out), the mixing weights are stable. If the histogram is wide (high variance), the cluster proportions fluctuate across different samples, suggesting instability in the clustering.

We can also obtain the numerical output of the resampling-based bootstrap distributions using:

sboot <- summary(boot, what = "ci")
sboot

---------------------------------------------------------- 
Resampling confidence intervals 
---------------------------------------------------------- 
Model                      = VVI 
Num. of mixture components = 3 
Replications               = 999 
Type                       = nonparametric bootstrap 
Confidence level           = 0.95 

Mixing probabilities:
              1          2         3
2.5%  0.1299583 0.08813538 0.4766704
97.5% 0.3824305 0.25094152 0.6770145

Means:
[,,1]
      BehvEngmnt CognEngmnt EmotEngmnt
2.5%    3.483877   1.847013   2.183722
97.5%   3.817675   2.443234   2.945904
[,,2]
      BehvEngmnt CognEngmnt EmotEngmnt
2.5%    4.622774   3.522695   4.510961
97.5%   4.899981   3.943921   4.880047
[,,3]
      BehvEngmnt CognEngmnt EmotEngmnt
2.5%    4.101332   2.837391   3.528846
97.5%   4.368107   3.136022   3.902029

Variances:
[,,1]
      BehvEngmnt CognEngmnt EmotEngmnt
2.5%   0.4045433  0.2229482  0.5526823
97.5%  0.7963305  0.4860229  0.9815111
[,,2]
      BehvEngmnt CognEngmnt EmotEngmnt
2.5%  0.01603775  0.3183906 0.01863152
97.5% 0.10346155  0.5750972 0.14583832
[,,3]
      BehvEngmnt CognEngmnt EmotEngmnt
2.5%   0.1619767  0.2850053  0.2286500
97.5%  0.2830485  0.3925976  0.4155711

The information above can then be used to plot against the cluster means, with the 95% confidence intervals shown as vertical errors bars:

means <- means |>
add_column(lower = as.vector(sboot$mean[1,,]),
upper = as.vector(sboot$mean[2,,]))
means

# A tibble: 9 × 6
  Profile Variable    Mean MixPro lower upper
  <fct>   <fct>      <dbl>  <dbl> <dbl> <dbl>
1 1       BehvEngmnt  3.70  0.290  3.48  3.82
2 1       CognEngmnt  2.29  0.290  1.85  2.44
3 1       EmotEngmnt  2.74  0.290  2.18  2.95
4 2       BehvEngmnt  4.71  0.162  4.62  4.90
5 2       CognEngmnt  3.70  0.162  3.52  3.94
6 2       EmotEngmnt  4.73  0.162  4.51  4.88
7 3       BehvEngmnt  4.26  0.548  4.10  4.37
8 3       CognEngmnt  3.02  0.548  2.84  3.14
9 3       EmotEngmnt  3.74  0.548  3.53  3.90

ggplot(means, aes(x = Variable, y = Mean, group = Profile,
shape = Profile, color = Profile)) +
geom_point(aes(size = MixPro)) +
geom_line(linewidth = 0.5) +
geom_errorbar(aes(ymin = lower, ymax = upper),
linewidth = 0.5, width = 0.1) +
labs(x = NULL, y = "Cluster means") +
scale_color_manual(values = mclust.options("classPlotColors")) +
scale_size(range = c(1, 3), guide = "none") +
theme_bw() +
theme(legend.position = "top")

The error bars for cognitive and emotional engagement are visibly wider for the “low” engagement cluster (cluster 1) compared to clusters 2 (high) and 3 (moderate), suggesting higher uncertainty where the cluster assignments are less stable across resampling.

Finally, in the case review paper, we read that the authors utilized an entropy measure of the cluster classification to estimate uncertainty.

Low entropy (0) indicates that the model is less certain in classifying data points into clusters, whereas high entropy (1) indicates the model is more certain in assigning cases to clusters, meaning that points had similar probabilities of belonging to multiple clusters. Thus, high entropy values typically indicate a better model that is able to distinguish between the cluster components and that the components are relatively distinct.

An entropy value close to 1 is ideal, while values above .6 are considered acceptable, although there is no agreed upon optimal cutoff for entropy. We can calculate entropy classification below:

probs <- mod$z # posterior conditional probs
probs_map <- apply(probs, 1, max) # maximum a posteriori probs
class <- mod$classification # cluster classes for each obs
n <- mod$n # number of obs
K <- mod$G # number of clusters
# Entropy
E <- 1 + sum(probs * log(probs))/(n * log(K))
E

[1] 0.6890602

## [1] 0.6890602
# Case-specific entropy contributions
Ei <- 1 + rowSums(probs * log(probs))/log(K)
sum(Ei)/n

[1] 0.6890602

## [1] 0.6890602
df_entropy <- data.frame(class = as.factor(class), entropy = Ei)
df_entropy |>
group_by(class) |>
summarise(count = n(),
mean = mean(entropy),
sd = sd(entropy),
min = min(entropy),
max = max(entropy))

# A tibble: 3 × 6
  class count  mean    sd   min   max
  <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 1       184 0.740 0.239 0.369 1.00 
2 2       119 0.690 0.225 0.187 0.993
3 3       414 0.666 0.197 0.172 0.974

We can see that the entropy of the clusters is moderately high, on a scale from 0 to 1, which some clusters demonstrating maximum (or near maximum) values. In particular, it appears that cluster 1, the “low” engagement cluster, again reflects that the model was uncertain in it’s cluster assignment across the bootstrap resampling.

ggplot(df_entropy, aes(y = class, x = entropy, fill = class)) +
geom_density_ridges(stat = "binline", bins = 21,
scale = 0.9, alpha = 0.5) +
scale_x_continuous(breaks = seq(0, 1 ,by=0.1),
limits = c(0, 1.05)) +
scale_fill_manual(values = mclust.options("classPlotColors")) +
geom_vline(xintercept = E, lty = 2) +
labs(x = "Case-specific entropy contribution",
y = "Cluster") +
theme_ridges(center_axis_labels = TRUE) +
theme(legend.position = "none",
panel.spacing = unit(1, "lines"),
strip.text.x = element_text(size = 8))

Again, when using GMM clustering, each observation cannot be assigned to a cluster with 100% certainty. Instead, each observation has a probability of belonging to each cluster. Finally, we apply a Average posterior membership probabilities (AvePP), a measure that represents how strongly an observation belongs to its assigned cluster. We calculate the average of the posterior probabilities for all observations in a given cluster. This also tells us how confident the model is in its clustering assignments. For example, if a cluster has 10 students and their average probability of belonging to that cluster is .92, it means that the model’s cluster assignments are highly confident.

# Average posterior probabilities by cluster:
df_AvePP <- data.frame(class = as.factor(class), pp = probs_map)
df_AvePP |>
group_by(class) |>
summarise(count = n(),
mean = mean(pp),
sd = sd(pp),
min = min(pp),
max = max(pp))

# A tibble: 3 × 6
  class count  mean    sd   min   max
  <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 1       184 0.864 0.160 0.513 1.00 
2 2       119 0.858 0.146 0.468 0.999
3 3       414 0.850 0.135 0.502 0.996

## # A tibble: 3 x 6
## class count mean sd min max
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 184 0.864 0.160 0.513 1.00
## 2 2 119 0.858 0.146 0.468 0.999
## 3 3 414 0.850 0.135 0.502 0.996

We plot the average membership probabilities for each cluster.

ggplot(df_AvePP, aes(y = class, x = pp, fill = class)) +
geom_density_ridges(stat = "binline", bins = 21, scale = 0.9, alpha = 0.5) +
scale_x_continuous(breaks = seq(0, 1, by=0.1),
limits = c(0, 1.05)) +
scale_fill_manual(values = mclust.options("classPlotColors")) +
labs(x = "MAP probabilities", y = "Cluster") +
theme_ridges(center_axis_labels = TRUE) +
theme(legend.position = "none",
panel.spacing = unit(1, "lines"),
strip.text.x = element_text(size = 8))

Question: What can we infer about the model’s confidence for each cluster given the MAP probabilities?

Question: How does GMM improve our ability to identify different types of learners compared to traditional clustering (e.g., K-means)?

Question: What are the potential limitations of using GMM in education? How might missing data, sample size, or model assumptions impact results?

References

Scrucca, Luca, Mohammed Saqr, Sonsoles López-Pernas, and Keefe Murphy. 2023. “An Introduction and Tutorial to Model-Based Clustering in Education via Gaussian Mixture Modelling.” arXiv Preprint arXiv:2306.06219.