Module 3: Sequential Pattern Mining

Case Study

Author

LASER Institute

Published

May 29, 2025

Paper Review

This case study is inspired by Zhou (2023). The paper’s method for sequential pattern mining is structured as a multi-step process that transforms raw submission log data into interpretable temporal patterns of group problem solving. Here’s a focused summary of their methodological approach:

1. Submission Labeling
Log data from an online assessment platform (PrairieLearn) is used to capture each “save and grade” event. The time interval between consecutive submissions is computed, and each submission is then classified into one of three categories:
• Quick (Q): Submission time below the 25th percentile.
• Medium (M): Submission time between the 25th and 75th percentiles.
• Slow (S): Submission time above the 75th percentile.
This classification is performed separately for each database query language and semester to adjust for varying conditions .

2. Sequence Compacting and Pattern Extraction
To reduce noise and capture meaningful problem-solving behaviors, the authors introduce a sequence compacting algorithm with two rules:
• Repetitive Compression: Three or more consecutive identical labels are merged into a single event, under the assumption that such repetition is unlikely to occur by chance.
• Consecutive Q/M Compacting: Since mixed quick and medium attempts occur frequently, consecutive Q and M labels are compacted into a single representative pattern.
This process results in the formation of seven distinct pattern types (Q, Q.LNG, M, M.LNG, QM.LNG, S, S.LNG) that encapsulate key behavioral sequences .

3. Transition Analysis Using Sequential Pattern Mining
After compacting, the focus shifts to understanding the transitions between the identified patterns. Two main analytic techniques are employed:
• Clustering: Agglomerative hierarchical clustering (using Ward’s method) is applied to group similar sequences. Although effective for shorter sequences, the approach faces challenges when sequences become longer.
• Transition Metrics and Correlation: The L* metric is introduced to evaluate the probability of transitions between pattern types, adjusting for base rates by comparing observed transition frequencies to those expected by chance. This metric allows the researchers to identify which transitions (e.g., from S to M.LNG versus Q.LNG to S.LNG) are statistically significant indicators of either effective or struggling problem-solving behaviors .

Together, these methods form a robust pipeline that transforms raw log data into interpretable temporal patterns, ultimately providing insights into group-level collaborative problem-solving strategies.

Python Activity

STEP 1: Generate Simulated Data

You will simulate submission log data for several groups. Each entry includes a group ID and a submission timestamp (in seconds).

import pandas as pd
import numpy as np
np.random.seed(0)
data = []
df= {}
for group in range(1, 6):  # simulate 5 groups
    # Generate 10 submission timestamps per group (in seconds) and sort them
    times = np.sort(np.random.randint(30, 300, size=10))
    for t in times:
        data.append({'group_id': group, 'submission_time': t})
df = pd.DataFrame(data)
print("Synthetic submission data:")
print(df)
Synthetic submission data:
    group_id  submission_time
0          1               39
1          1               77
2          1              117
3          1              147
4          1              202
5          1              222
6          1              225
7          1              241
8          1              272
9          1              281
10         2               55
11         2               69
12         2              100
13         2              102
14         2              117
15         2              118
16         2              118
17         2              195
18         2              204
19         2              223
20         3              129
21         3              145
22         3              177
23         3              177
24         3              207
25         3              227
26         3              273
27         3              273
28         3              295
29         3              295
30         4               58
31         4               61
32         4               62
33         4              157
34         4              181
35         4              193
36         4              213
37         4              215
38         4              232
39         4              274
40         5               61
41         5               68
42         5               72
43         5               83
44         5               87
45         5              135
46         5              158
47         5              158
48         5              274
49         5              287

STEP 2: Compute Time Intervals Between Submissions

df['time_diff'] = df.groupby('group_id')['submission_time'].diff().fillna(0)
print("\nSubmission data with time differences:")
print(df)

Submission data with time differences:
    group_id  submission_time  time_diff
0          1               39        0.0
1          1               77       38.0
2          1              117       40.0
3          1              147       30.0
4          1              202       55.0
5          1              222       20.0
6          1              225        3.0
7          1              241       16.0
8          1              272       31.0
9          1              281        9.0
10         2               55        0.0
11         2               69       14.0
12         2              100       31.0
13         2              102        2.0
14         2              117       15.0
15         2              118        1.0
16         2              118        0.0
17         2              195       77.0
18         2              204        9.0
19         2              223       19.0
20         3              129        0.0
21         3              145       16.0
22         3              177       32.0
23         3              177        0.0
24         3              207       30.0
25         3              227       20.0
26         3              273       46.0
27         3              273        0.0
28         3              295       22.0
29         3              295        0.0
30         4               58        0.0
31         4               61        3.0
32         4               62        1.0
33         4              157       95.0
34         4              181       24.0
35         4              193       12.0
36         4              213       20.0
37         4              215        2.0
38         4              232       17.0
39         4              274       42.0
40         5               61        0.0
41         5               68        7.0
42         5               72        4.0
43         5               83       11.0
44         5               87        4.0
45         5              135       48.0
46         5              158       23.0
47         5              158        0.0
48         5              274      116.0
49         5              287       13.0

STEP 3: Label Each Submission (Q, M, or S)

Exclude the first submission (time_diff = 0) when calculating percentiles.

time_diffs = df[df['time_diff'] > 0]['time_diff']
q25 = np.percentile(time_diffs, 25)
q75 = np.percentile(time_diffs, 75)
print("\n25th percentile:", q25, "75th percentile:", q75)

def label_submission(duration, q25, q75):
    # Label submissions based on duration thresholds.
    if duration == 0:
        return 'Start'  # mark the first submission
    if duration < q25:
        return 'Q'      # Quick submission
    elif duration < q75:
        return 'M'      # Medium submission
    else:
        return 'S'      # Slow submission

df['label'] = df['time_diff'].apply(lambda x: label_submission(x, q25, q75))
print("\nSubmission data with labels:")
print(df)

25th percentile: 9.0 75th percentile: 31.25

Submission data with labels:
    group_id  submission_time  time_diff  label
0          1               39        0.0  Start
1          1               77       38.0      S
2          1              117       40.0      S
3          1              147       30.0      M
4          1              202       55.0      S
5          1              222       20.0      M
6          1              225        3.0      Q
7          1              241       16.0      M
8          1              272       31.0      M
9          1              281        9.0      M
10         2               55        0.0  Start
11         2               69       14.0      M
12         2              100       31.0      M
13         2              102        2.0      Q
14         2              117       15.0      M
15         2              118        1.0      Q
16         2              118        0.0  Start
17         2              195       77.0      S
18         2              204        9.0      M
19         2              223       19.0      M
20         3              129        0.0  Start
21         3              145       16.0      M
22         3              177       32.0      S
23         3              177        0.0  Start
24         3              207       30.0      M
25         3              227       20.0      M
26         3              273       46.0      S
27         3              273        0.0  Start
28         3              295       22.0      M
29         3              295        0.0  Start
30         4               58        0.0  Start
31         4               61        3.0      Q
32         4               62        1.0      Q
33         4              157       95.0      S
34         4              181       24.0      M
35         4              193       12.0      M
36         4              213       20.0      M
37         4              215        2.0      Q
38         4              232       17.0      M
39         4              274       42.0      S
40         5               61        0.0  Start
41         5               68        7.0      Q
42         5               72        4.0      Q
43         5               83       11.0      M
44         5               87        4.0      Q
45         5              135       48.0      S
46         5              158       23.0      M
47         5              158        0.0  Start
48         5              274      116.0      S
49         5              287       13.0      M

STEP 4: Extract Submission Sequences Per Group

Create a dictionary mapping each group to its sequence of labels (excluding the ‘Start’ label).

group_sequences = {}
for group, group_df in df.groupby('group_id'):
    seq = list(group_df[group_df['label'] != 'Start']['label'])
    group_sequences[group] = seq

print("\nOriginal sequences for each group:")
for group, seq in group_sequences.items():
    print(f"Group {group}: {seq}")

Original sequences for each group:
Group 1: ['S', 'S', 'M', 'S', 'M', 'Q', 'M', 'M', 'M']
Group 2: ['M', 'M', 'Q', 'M', 'Q', 'S', 'M', 'M']
Group 3: ['M', 'S', 'M', 'M', 'S', 'M']
Group 4: ['Q', 'Q', 'S', 'M', 'M', 'M', 'Q', 'M', 'S']
Group 5: ['Q', 'Q', 'M', 'Q', 'S', 'M', 'S', 'M']

STEP 5: Sequence Compacting Function

The function applies two rules: # (1) Compress consecutive identical labels (if count >= 3, mark as label.LNG) # (2) Merge consecutive Q and M labels (if their sequence length >= 3, mark as QM.LNG)

def compact_sequence(seq):
    # Rule 1: Compress consecutive identical labels
    if not seq:
        return []
    
    compacted = []
    i = 0
    while i < len(seq):
        count = 1
        # Count how many times the same label appears consecutively.
        while i + count < len(seq) and seq[i + count] == seq[i]:
            count += 1
        if count >= 3:
            compacted.append(seq[i] + ".LNG")
        else:
            compacted.extend([seq[i]] * count)
        i += count

    # Rule 2: Merge consecutive Q and M labels (that are not already marked as LNG)
    final_seq = []
    i = 0
    while i < len(compacted):
        # If the label is Q or M, start checking for a contiguous segment.
        if compacted[i] in ['Q', 'M']:
            j = i
            temp = []
            while j < len(compacted) and compacted[j] in ['Q', 'M']:
                temp.append(compacted[j])
                j += 1
            if len(temp) >= 3:
                final_seq.append("QM.LNG")
            else:
                final_seq.extend(temp)
            i = j
        else:
            final_seq.append(compacted[i])
            i += 1
    return final_seq

# Apply sequence compacting for each group.
compacted_sequences = {}
for group, seq in group_sequences.items():
    compacted_sequences[group] = compact_sequence(seq)

print("\nCompacted sequences for each group:")
for group, seq in compacted_sequences.items():
    print(f"Group {group}: {seq}")

Compacted sequences for each group:
Group 1: ['S', 'S', 'M', 'S', 'M', 'Q', 'M.LNG']
Group 2: ['QM.LNG', 'S', 'M', 'M']
Group 3: ['M', 'S', 'M', 'M', 'S', 'M']
Group 4: ['Q', 'Q', 'S', 'M.LNG', 'Q', 'M', 'S']
Group 5: ['QM.LNG', 'S', 'M', 'S', 'M']

STEP 6: Compute Transition Counts Between Patterns

For each compacted sequence, count transitions between consecutive patterns.

def compute_transitions(seq):
    transitions = {}
    for i in range(len(seq) - 1):
        transition = (seq[i], seq[i+1])
        transitions[transition] = transitions.get(transition, 0) + 1
    return transitions

all_transitions = {}
for group, seq in compacted_sequences.items():
    all_transitions[group] = compute_transitions(seq)

print("\nTransition counts for each group:")
for group, trans in all_transitions.items():
    print(f"Group {group}: {trans}")

Transition counts for each group:
Group 1: {('S', 'S'): 1, ('S', 'M'): 2, ('M', 'S'): 1, ('M', 'Q'): 1, ('Q', 'M.LNG'): 1}
Group 2: {('QM.LNG', 'S'): 1, ('S', 'M'): 1, ('M', 'M'): 1}
Group 3: {('M', 'S'): 2, ('S', 'M'): 2, ('M', 'M'): 1}
Group 4: {('Q', 'Q'): 1, ('Q', 'S'): 1, ('S', 'M.LNG'): 1, ('M.LNG', 'Q'): 1, ('Q', 'M'): 1, ('M', 'S'): 1}
Group 5: {('QM.LNG', 'S'): 1, ('S', 'M'): 2, ('M', 'S'): 1}

STEP 7: Aggregate Transitions and Compute Probabilities

aggregate_counts = {}
total_transitions = 0
for trans in all_transitions.values():
    for k, v in trans.items():
        aggregate_counts[k] = aggregate_counts.get(k, 0) + v
        total_transitions += v

transition_probabilities = {k: v/total_transitions for k, v in aggregate_counts.items()}
print("\nAggregate transition probabilities:")
for k, v in transition_probabilities.items():
    print(f"{k} : {v:.2f}")

Aggregate transition probabilities:
('S', 'S') : 0.04
('S', 'M') : 0.29
('M', 'S') : 0.21
('M', 'Q') : 0.04
('Q', 'M.LNG') : 0.04
('QM.LNG', 'S') : 0.08
('M', 'M') : 0.08
('Q', 'Q') : 0.04
('Q', 'S') : 0.04
('S', 'M.LNG') : 0.04
('M.LNG', 'Q') : 0.04
('Q', 'M') : 0.04
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import linkage, dendrogram

# ------------------------------
# Step A: Build Transition Frequency Matrix
# ------------------------------
# 1. Identify all unique transitions from all groups.
all_possible_transitions = set()
for group, trans_dict in all_transitions.items():
    for trans in trans_dict.keys():
        all_possible_transitions.add(trans)
all_possible_transitions = sorted(all_possible_transitions)  # Sort for consistency

# 2. Build a DataFrame: rows = groups, columns = transitions.
# We'll use the normalized frequency (count divided by total transitions for that group) 
# so that groups with different numbers of submissions are comparable.
group_ids = list(all_transitions.keys())
data_matrix = []
for group in group_ids:
    group_counts = []
    total = sum(all_transitions[group].values())
    for trans in all_possible_transitions:
        count = all_transitions[group].get(trans, 0)
        freq = count / total if total > 0 else 0
        group_counts.append(freq)
    data_matrix.append(group_counts)

# Use a more readable column naming format, e.g., "Q->M" for a transition from Q to M.
transition_columns = [f"{src}->{dst}" for src, dst in all_possible_transitions]
transition_df = pd.DataFrame(data_matrix, index=group_ids, columns=transition_columns)
print("Transition Frequency DataFrame:")
print(transition_df)

# ------------------------------
# Step B: Hierarchical Clustering
# ------------------------------
linked = linkage(transition_df, method='ward')

plt.figure(figsize=(10, 5))
dendrogram(linked, labels=transition_df.index.astype(str))
plt.title("Hierarchical Clustering Dendrogram of Groups")
plt.xlabel("Group ID")
plt.ylabel("Ward Distance")
plt.show()

# ------------------------------
# Step C: Correlation Analysis of Transitions
# ------------------------------
# Compute the Pearson correlation among transition frequencies across groups.
correlation_matrix = transition_df.corr(method='pearson')
print("\nCorrelation Matrix of Transitions:")
print(correlation_matrix)

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix of Transition Frequencies")
plt.show()
Transition Frequency DataFrame:
       M->M      M->Q      M->S  M.LNG->Q      Q->M  Q->M.LNG      Q->Q  \
1  0.000000  0.166667  0.166667  0.000000  0.000000  0.166667  0.000000   
2  0.333333  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
3  0.200000  0.000000  0.400000  0.000000  0.000000  0.000000  0.000000   
4  0.000000  0.000000  0.166667  0.166667  0.166667  0.000000  0.166667   
5  0.000000  0.000000  0.250000  0.000000  0.000000  0.000000  0.000000   

       Q->S  QM.LNG->S      S->M  S->M.LNG      S->S  
1  0.000000   0.000000  0.333333  0.000000  0.166667  
2  0.000000   0.333333  0.333333  0.000000  0.000000  
3  0.000000   0.000000  0.400000  0.000000  0.000000  
4  0.166667   0.000000  0.000000  0.166667  0.000000  
5  0.000000   0.250000  0.500000  0.000000  0.000000  


Correlation Matrix of Transitions:
               M->M      M->Q      M->S  M.LNG->Q      Q->M  Q->M.LNG  \
M->M       1.000000 -0.388514 -0.278659 -0.388514 -0.388514 -0.388514   
M->Q      -0.388514  1.000000 -0.115271 -0.250000 -0.250000  1.000000   
M->S      -0.278659 -0.115271  1.000000 -0.115271 -0.115271 -0.115271   
M.LNG->Q  -0.388514 -0.250000 -0.115271  1.000000  1.000000 -0.250000   
Q->M      -0.388514 -0.250000 -0.115271  1.000000  1.000000 -0.250000   
Q->M.LNG  -0.388514  1.000000 -0.115271 -0.250000 -0.250000  1.000000   
Q->Q      -0.388514 -0.250000 -0.115271  1.000000  1.000000 -0.250000   
Q->S      -0.388514 -0.250000 -0.115271  1.000000  1.000000 -0.250000   
QM.LNG->S  0.490222 -0.401478 -0.552406 -0.401478 -0.401478 -0.401478   
S->M       0.207976  0.059479  0.296594 -0.931836 -0.931836  0.059479   
S->M.LNG  -0.388514 -0.250000 -0.115271  1.000000  1.000000 -0.250000   
S->S      -0.388514  1.000000 -0.115271 -0.250000 -0.250000  1.000000   

               Q->Q      Q->S  QM.LNG->S      S->M  S->M.LNG      S->S  
M->M      -0.388514 -0.388514   0.490222  0.207976 -0.388514 -0.388514  
M->Q      -0.250000 -0.250000  -0.401478  0.059479 -0.250000  1.000000  
M->S      -0.115271 -0.115271  -0.552406  0.296594 -0.115271 -0.115271  
M.LNG->Q   1.000000  1.000000  -0.401478 -0.931836  1.000000 -0.250000  
Q->M       1.000000  1.000000  -0.401478 -0.931836  1.000000 -0.250000  
Q->M.LNG  -0.250000 -0.250000  -0.401478  0.059479 -0.250000  1.000000  
Q->Q       1.000000  1.000000  -0.401478 -0.931836  1.000000 -0.250000  
Q->S       1.000000  1.000000  -0.401478 -0.931836  1.000000 -0.250000  
QM.LNG->S -0.401478 -0.401478   1.000000  0.436652 -0.401478 -0.401478  
S->M      -0.931836 -0.931836   0.436652  1.000000 -0.931836  0.059479  
S->M.LNG   1.000000  1.000000  -0.401478 -0.931836  1.000000 -0.250000  
S->S      -0.250000 -0.250000  -0.401478  0.059479 -0.250000  1.000000  

Notes

As a side note, please note that although this approach is definitely sequential pattern mining, it’s not the most common form of sequential pattern mining used in the field. An approach more similar to the association rule mining approach (but only considering patterns spanning time) is more often used in the field; we present this example as an alternative, to give you a sense of the range of methods being used for relationship mining in the field.

Reference

Zhou, Y. (2023). Identifying collaborative problem-solving behaviors using sequential pattern mining. 2023 American Society for Engineering Education Annual Conference & Exposition. American Society for Engineering Education.