This case study is inspired by Zhou (2023). The paper’s method for sequential pattern mining is structured as a multi-step process that transforms raw submission log data into interpretable temporal patterns of group problem solving. Here’s a focused summary of their methodological approach:
1. Submission Labeling
Log data from an online assessment platform (PrairieLearn) is used to capture each “save and grade” event. The time interval between consecutive submissions is computed, and each submission is then classified into one of three categories:
• Quick (Q): Submission time below the 25th percentile.
• Medium (M): Submission time between the 25th and 75th percentiles.
• Slow (S): Submission time above the 75th percentile.
This classification is performed separately for each database query language and semester to adjust for varying conditions .
2. Sequence Compacting and Pattern Extraction
To reduce noise and capture meaningful problem-solving behaviors, the authors introduce a sequence compacting algorithm with two rules:
• Repetitive Compression: Three or more consecutive identical labels are merged into a single event, under the assumption that such repetition is unlikely to occur by chance.
• Consecutive Q/M Compacting: Since mixed quick and medium attempts occur frequently, consecutive Q and M labels are compacted into a single representative pattern.
This process results in the formation of seven distinct pattern types (Q, Q.LNG, M, M.LNG, QM.LNG, S, S.LNG) that encapsulate key behavioral sequences .
3. Transition Analysis Using Sequential Pattern Mining
After compacting, the focus shifts to understanding the transitions between the identified patterns. Two main analytic techniques are employed:
• Clustering: Agglomerative hierarchical clustering (using Ward’s method) is applied to group similar sequences. Although effective for shorter sequences, the approach faces challenges when sequences become longer.
• Transition Metrics and Correlation: The L* metric is introduced to evaluate the probability of transitions between pattern types, adjusting for base rates by comparing observed transition frequencies to those expected by chance. This metric allows the researchers to identify which transitions (e.g., from S to M.LNG versus Q.LNG to S.LNG) are statistically significant indicators of either effective or struggling problem-solving behaviors .
Together, these methods form a robust pipeline that transforms raw log data into interpretable temporal patterns, ultimately providing insights into group-level collaborative problem-solving strategies.
Python Activity
STEP 1: Generate Simulated Data
You will simulate submission log data for several groups. Each entry includes a group ID and a submission timestamp (in seconds).
import pandas as pdimport numpy as npnp.random.seed(0)data = []df= {}for group inrange(1, 6): # simulate 5 groups# Generate 10 submission timestamps per group (in seconds) and sort them times = np.sort(np.random.randint(30, 300, size=10))for t in times: data.append({'group_id': group, 'submission_time': t})df = pd.DataFrame(data)print("Synthetic submission data:")print(df)
Exclude the first submission (time_diff = 0) when calculating percentiles.
time_diffs = df[df['time_diff'] >0]['time_diff']q25 = np.percentile(time_diffs, 25)q75 = np.percentile(time_diffs, 75)print("\n25th percentile:", q25, "75th percentile:", q75)def label_submission(duration, q25, q75):# Label submissions based on duration thresholds.if duration ==0:return'Start'# mark the first submissionif duration < q25:return'Q'# Quick submissionelif duration < q75:return'M'# Medium submissionelse:return'S'# Slow submissiondf['label'] = df['time_diff'].apply(lambda x: label_submission(x, q25, q75))print("\nSubmission data with labels:")print(df)
25th percentile: 9.0 75th percentile: 31.25
Submission data with labels:
group_id submission_time time_diff label
0 1 39 0.0 Start
1 1 77 38.0 S
2 1 117 40.0 S
3 1 147 30.0 M
4 1 202 55.0 S
5 1 222 20.0 M
6 1 225 3.0 Q
7 1 241 16.0 M
8 1 272 31.0 M
9 1 281 9.0 M
10 2 55 0.0 Start
11 2 69 14.0 M
12 2 100 31.0 M
13 2 102 2.0 Q
14 2 117 15.0 M
15 2 118 1.0 Q
16 2 118 0.0 Start
17 2 195 77.0 S
18 2 204 9.0 M
19 2 223 19.0 M
20 3 129 0.0 Start
21 3 145 16.0 M
22 3 177 32.0 S
23 3 177 0.0 Start
24 3 207 30.0 M
25 3 227 20.0 M
26 3 273 46.0 S
27 3 273 0.0 Start
28 3 295 22.0 M
29 3 295 0.0 Start
30 4 58 0.0 Start
31 4 61 3.0 Q
32 4 62 1.0 Q
33 4 157 95.0 S
34 4 181 24.0 M
35 4 193 12.0 M
36 4 213 20.0 M
37 4 215 2.0 Q
38 4 232 17.0 M
39 4 274 42.0 S
40 5 61 0.0 Start
41 5 68 7.0 Q
42 5 72 4.0 Q
43 5 83 11.0 M
44 5 87 4.0 Q
45 5 135 48.0 S
46 5 158 23.0 M
47 5 158 0.0 Start
48 5 274 116.0 S
49 5 287 13.0 M
STEP 4: Extract Submission Sequences Per Group
Create a dictionary mapping each group to its sequence of labels (excluding the ‘Start’ label).
group_sequences = {}for group, group_df in df.groupby('group_id'): seq =list(group_df[group_df['label'] !='Start']['label']) group_sequences[group] = seqprint("\nOriginal sequences for each group:")for group, seq in group_sequences.items():print(f"Group {group}: {seq}")
Original sequences for each group:
Group 1: ['S', 'S', 'M', 'S', 'M', 'Q', 'M', 'M', 'M']
Group 2: ['M', 'M', 'Q', 'M', 'Q', 'S', 'M', 'M']
Group 3: ['M', 'S', 'M', 'M', 'S', 'M']
Group 4: ['Q', 'Q', 'S', 'M', 'M', 'M', 'Q', 'M', 'S']
Group 5: ['Q', 'Q', 'M', 'Q', 'S', 'M', 'S', 'M']
STEP 5: Sequence Compacting Function
The function applies two rules: # (1) Compress consecutive identical labels (if count >= 3, mark as label.LNG) # (2) Merge consecutive Q and M labels (if their sequence length >= 3, mark as QM.LNG)
def compact_sequence(seq):# Rule 1: Compress consecutive identical labelsifnot seq:return [] compacted = [] i =0while i <len(seq): count =1# Count how many times the same label appears consecutively.while i + count <len(seq) and seq[i + count] == seq[i]: count +=1if count >=3: compacted.append(seq[i] +".LNG")else: compacted.extend([seq[i]] * count) i += count# Rule 2: Merge consecutive Q and M labels (that are not already marked as LNG) final_seq = [] i =0while i <len(compacted):# If the label is Q or M, start checking for a contiguous segment.if compacted[i] in ['Q', 'M']: j = i temp = []while j <len(compacted) and compacted[j] in ['Q', 'M']: temp.append(compacted[j]) j +=1iflen(temp) >=3: final_seq.append("QM.LNG")else: final_seq.extend(temp) i = jelse: final_seq.append(compacted[i]) i +=1return final_seq# Apply sequence compacting for each group.compacted_sequences = {}for group, seq in group_sequences.items(): compacted_sequences[group] = compact_sequence(seq)print("\nCompacted sequences for each group:")for group, seq in compacted_sequences.items():print(f"Group {group}: {seq}")
Compacted sequences for each group:
Group 1: ['S', 'S', 'M', 'S', 'M', 'Q', 'M.LNG']
Group 2: ['QM.LNG', 'S', 'M', 'M']
Group 3: ['M', 'S', 'M', 'M', 'S', 'M']
Group 4: ['Q', 'Q', 'S', 'M.LNG', 'Q', 'M', 'S']
Group 5: ['QM.LNG', 'S', 'M', 'S', 'M']
STEP 6: Compute Transition Counts Between Patterns
For each compacted sequence, count transitions between consecutive patterns.
def compute_transitions(seq): transitions = {}for i inrange(len(seq) -1): transition = (seq[i], seq[i+1]) transitions[transition] = transitions.get(transition, 0) +1return transitionsall_transitions = {}for group, seq in compacted_sequences.items(): all_transitions[group] = compute_transitions(seq)print("\nTransition counts for each group:")for group, trans in all_transitions.items():print(f"Group {group}: {trans}")
STEP 7: Aggregate Transitions and Compute Probabilities
aggregate_counts = {}total_transitions =0for trans in all_transitions.values():for k, v in trans.items(): aggregate_counts[k] = aggregate_counts.get(k, 0) + v total_transitions += vtransition_probabilities = {k: v/total_transitions for k, v in aggregate_counts.items()}print("\nAggregate transition probabilities:")for k, v in transition_probabilities.items():print(f"{k} : {v:.2f}")
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy.cluster.hierarchy import linkage, dendrogram# ------------------------------# Step A: Build Transition Frequency Matrix# ------------------------------# 1. Identify all unique transitions from all groups.all_possible_transitions =set()for group, trans_dict in all_transitions.items():for trans in trans_dict.keys(): all_possible_transitions.add(trans)all_possible_transitions =sorted(all_possible_transitions) # Sort for consistency# 2. Build a DataFrame: rows = groups, columns = transitions.# We'll use the normalized frequency (count divided by total transitions for that group) # so that groups with different numbers of submissions are comparable.group_ids =list(all_transitions.keys())data_matrix = []for group in group_ids: group_counts = [] total =sum(all_transitions[group].values())for trans in all_possible_transitions: count = all_transitions[group].get(trans, 0) freq = count / total if total >0else0 group_counts.append(freq) data_matrix.append(group_counts)# Use a more readable column naming format, e.g., "Q->M" for a transition from Q to M.transition_columns = [f"{src}->{dst}"for src, dst in all_possible_transitions]transition_df = pd.DataFrame(data_matrix, index=group_ids, columns=transition_columns)print("Transition Frequency DataFrame:")print(transition_df)# ------------------------------# Step B: Hierarchical Clustering# ------------------------------linked = linkage(transition_df, method='ward')plt.figure(figsize=(10, 5))dendrogram(linked, labels=transition_df.index.astype(str))plt.title("Hierarchical Clustering Dendrogram of Groups")plt.xlabel("Group ID")plt.ylabel("Ward Distance")plt.show()# ------------------------------# Step C: Correlation Analysis of Transitions# ------------------------------# Compute the Pearson correlation among transition frequencies across groups.correlation_matrix = transition_df.corr(method='pearson')print("\nCorrelation Matrix of Transitions:")print(correlation_matrix)plt.figure(figsize=(10, 8))sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")plt.title("Correlation Matrix of Transition Frequencies")plt.show()
As a side note, please note that although this approach is definitely sequential pattern mining, it’s not the most common form of sequential pattern mining used in the field. An approach more similar to the association rule mining approach (but only considering patterns spanning time) is more often used in the field; we present this example as an alternative, to give you a sense of the range of methods being used for relationship mining in the field.
Reference
Zhou, Y. (2023). Identifying collaborative problem-solving behaviors using sequential pattern mining. 2023 American Society for Engineering Education Annual Conference & Exposition. American Society for Engineering Education.