Module 1: Bayesian Knowledge Tracing

A Little History

The classic approach for measuring tightly defined skills in online learning
First proposed by Richard Atkinson
Most thoroughly articulated and studied by Albert Corbett and John Anderson Corbett & Anderson (1995)

Flexibility of BKT

Been around a long time
Still as of today the most widely-used knowledge tracing algorithm used at scale
- Interpretable
- Predictable
- Decent performance

The Key Goal of BKT

Measuring how well a student knows a specific skill/knowledge component at a specific time
Based on their past history of performance with that skill/KC

What is the typical use of BKT?

Assess a student’s knowledge of skill/KC X

Based on a sequence of items that are scored between 0 and 1
- Classically 0 or 1, but there are variants that relax this

Where each item corresponds to a single skill

Where the student can learn on each item, due to help, feedback, scaffolding, etc.

Key assumptions of BKT

Each item must involve a single latent trait or skill
- Different from PFA, which we’ll talk about next lecture

Each skill has four parameters

Only the first attempt on each item matters
- i.e. is included in calculations

Help use usually treated as same as incorrect
- Some exceptions I will discuss later

Key assumptions of BKT

Each skill has four parameters
From these parameters, and the pattern of successes and failures the student has had on each relevant skill so far

We can compute
- Latent knowledge P(L_n)
- The probability P(CORR) that the learner will get the item correct

Key assumptions of BKT

We assume that when P(Ln) reaches 0.95, a student has attained mastery of a skill.
This represents the point at which the probability of a student being in the mastery state exceeds 95%.
When we detect that this happens, it can be used as a condition in the learning system to move to new material.

Key assumptions of BKT

Two-state learning model

Each skill is either learned or unlearned

In problem-solving, the student can learn a skill at each opportunity to apply the skill
Each problem (opportunity) has the same chance of learning.

A student does not forget a skill, once he or she knows it

Model Performance Assumptions

If the student knows a skill, there is still some chance the student will slip and make a mistake.

If the student does not know a skill, there is still some chance the student will guess correctly.

Comments or Questions?

Learning Parameters

Two Learning Parameters

p(L₀). Probability the skill is already known before the first opportunity to use the skill in problem solving.

p(T). Probability the skill will be learned at each opportunity to use the skill.

Learning Parameters

Two Learning Parameters

p(L₀). Probability the skill is already known before the first opportunity to use the skill in problem solving.

p(T). Probability the skill will be learned at each opportunity to use the skill.

Performance Parameters

Two Performance Parameters

p(G). Probability the student will guess correctly if the skill is not known.

p(S). Probability the student will slip (make a mistake) if the skill is known.

Performance Parameters

Two Performance Parameters

p(G). Probability the student will guess correctly if the skill is not known.

p(S). Probability the student will slip (make a mistake) if the skill is known.

Comments? Questions?

Predicting Current Student Correctness

PCORR = P(L_n)P(~S)+P(~L_n)P(G)

P(L_n) Probability the student know it

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Bayesian Knowledge Tracing

Whenever the student has an opportunity to use a skill
The probability that the student knows the skill is updated
Using formulas derived from Bayes’ Theorem

In Bayesian Knowledge Tracing, when the student has an opportunity to use a skill, the probability that the student knows the skill is updated using formulas derived from Bayes’ Theorem.

Formulas

\[ P(L_{n-1}|Correct_{n}) = \frac{P(L_{n-1})*(1-P(S))}{P(L_{n-1})*(1-P(S))+(1-P(L_{n-1}))*P(G)} \\\\\\ P(L_{n-1}|Incorrect_{n}) = \frac{P(L_{n-1})*(P(S))}{P(L_{n-1})*(P(S))+(1-P(L_{n-1}))*(1-P(G))} \\\\\\ P(L_{n}|Action_{n}) = P(L_{n-1}|Action_{n}) +((1- P(L_{n-1}|Action_{n})) * p(T)) \]

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

The formulas are as follows:

The probability that you knew it beforehand, given that you got it correct, is the probability that you knew it beforehand and didn’t slip over the probability you knew it and didn’t slip, plus the probability that you didn’t know it and you guessed.
Similarly, if you got it wrong, then how did you get it wrong? You must have known it and slipped you already knew it. So if you already knew it, then you previously knew it and you slipped, and the two possibilities are you previously knew it and you slipped, and you didn’t know it and you didn’t guess.
Finally, once we know the probability that they knew it beforehand, given their correctness now, we can look at whether they learned it. So the probability that they know it at time n given action n, so after the action, is the probability they know it before the action, plus the probability they didn’t know it before the action, times the probability that they learned it. In other words, let’s say there’s a 30% chance you knew it after the previous action and a 10% chance you learned it. P of T is 10%. In that case, the probability you knew it afterward will be 0.3 plus 0.7, the probability you didn’t know it, times 0.1 for 0.37.

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
	0.4

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)
0	0.4	\[ \frac{(0.4)(0.3)}{(0.4)(0.3)+(0.6)(0.8)} \]
.
.
.

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)
0	0.4	\[ \frac{(0.12)}{(0.12)+(0.48)} \]
.

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)
0	0.4	0.2

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.2+(0.8)(0.1)

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.28

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.28
	0.28

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.28
1	0.28

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.28
1	0.28	\[ \frac{(0.28)(0.7)}{(0.28)(0.7)+(0.72)(0.2)} \]

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.28
1	0.28	\[ \frac{(0.196)}{(0.196)+(0.144)} \]

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.28
1	0.28	0.58

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.28
1	0.28	0.58	(0.58) + (0.42)(0.1)

Example

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.28
1	0.28	0.58	0.62

Your Turn

P(L₀) = 0.4, P(T) = 0.1, P(S) = 0.3, P(G) = 0.2

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Actual	P(L_n-1)	P(L_n-1\|actual)	P(L_n)
0	0.4	0.2	0.28
1	0.28	0.58	0.62
1

Comments? Questions?

Parameter Constraints

Typically, the potential values of BKT parameters are constrained
To avoid model degeneracy

Conceptual Idea Behind Knowledge Tracing

Knowing a skill generally leads to correct performance
Correct performance implies that a student knows the relevant skill
Hence, by looking at whether a student’s performance is correct, we can infer whether they know the skill

Essentially

A knowledge model is degenerate when it violates this idea

When knowing a skill leads to worse performance

When getting a skill wrong means you know it

Parameter Constraints Proposed

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

Beck
- P(G)+P(S)<1.0
R. S. d. Baker, Corbett, & Aleven (2008):
- P(G)<0.5, P(S)<0.5
Corbett & Anderson (1995)
- P(G)<0.3, P(S)<0.1

Joe Beck has proposed that the probability of guessing plus the probability of slip must be less than 1.
Baker, Corbin, and Alavin have proposed that guess and slip each have to be less than 0.5.
Corbin and Anderson originally proposed, and this was not entirely just for model degeneracy but also based on some theorizing about what was probable, that P of G has to be less than 0.3 and P of S has to be less than 0.1.
Baker would say that when either guess or slip gets above 0.5, you’re in a situation where the behavior doesn’t mean what it looks like.
Beck would say that for some cases where the modeling can get difficult, specifically like with automated speech responses where you’re making inferences, there might be enough error that you might get a guess or slip above 0.5. But as long as they’re under 1.0 as a sum, it’s still okay.

Knowledge Tracing

How do we know if a knowledge tracing model is any good?
Our primary goal is to predict knowledge

Knowledge Tracing

How do we know if a knowledge tracing model is any good?
Our primary goal is to predict knowledge

But knowledge is a latent trait

So we instead check our knowledge predictions by checking how well the model predicts performance

Fitting a Knowledge-Tracing Model

In principle, any set of four parameters can be used by knowledge-tracing
But parameters that predict student performance better are preferred

Knowledge-Tracing

So, we pick the knowledge tracing parameters that best predict performance
Defined as whether a student’s action will be correct or wrong at a given time

Are these the same thing?

Predicting performance on next attempt
Inferring latent knowledge

What are some alternate ways to assess

Whether a model is successful at inferring latent knowledge

What are some alternate ways to assess

Whether a model is successful at inferring latent knowledge
Why aren’t those approaches used more often?

Comments? Questions?

Fitting a Knowledge-Tracing Model

In principle, any set of four parameters can be used by knowledge-tracing
But parameters that predict student performance better are preferred

Fit Methods

I could spend an hour talking about the ways to fit Bayesian Knowledge Tracing models.

Five Public Tools

hmmsclbl
- http://www.yudelson.info/hmm-scalable/
BNT-SM: Bayes Net Toolkit – Student Modeling
- http://www.cs.cmu.edu/~listen/BNT-SM/
BKT-BF: BKT-Brute Force (Grid Search)
https://learninganalytics.upenn.edu/ryanbaker/BKT-BruteForce.zip
Python Grid Search (slower than BKT-BF)
- https://github.com/ChNabil/BKT_python_gridsearch
pyBKT
- https://github.com/CAHLR/pyBK

Which one should you use?

They’re all fine – they work approximately equally well
My group uses BKT-BF to fit Classical BKT and BNT-SM to fit variant models
But some commercial colleagues use Fit BKT at Scale

Note…

The Equation Solver in Excel replicably does worse for this problem than these packages

How much data do you need? Slater & Baker (2018)

Depends on your goal

Predict student mastery, if ok to be off by 2-3 problems:
- As few as 25 students and 3 problems apiece, if P(T) values are low
Predict student mastery, if higher precision desired:
- 250 students and 3 problems apiece

Make inferences about model parameter values (for example, to identify skills that need to be fixed)
- 250 students and 6 problems apiece

One common practical question that people ask is:

how much data you need to fit BKT.
The answer depends on your goal. As Slater and Baker’s large-scale simulation study showed, if you’re intending to predict student mastery, and it’s ok for the system to decide the student has mastered two or three problems too early or too late, you can get away with as few as 25 students and 3 problems per skill, as long as P(T) values are low. If you do this, and see high P(T) values, you might need to get more data. If you want high precision on exactly when the student mastered, you might want to go as high as 250 students, but 3 problems per student is still generally OK. A harder task is making inferences about model parameter values. You might do this if, for example, you want to find skills with really high slip or guess rates, to fix them. In this case, you need 250 students and 6 problems per student.

BKT: Core Uses

Mastery learning
Reports to teachers on student skill

BKT: Extended Uses

Use in behavior detectors (such as gaming the system)
Use to identify problematic skills for re-design (with very high slip or guess or initial knowledge)
Use in discovery with models analyses (such as correlating student in-platform learning to test scores)

BKT: Extended Uses

Conditionalizing P(T)
- Does help help? (Beck, Chang, Mostow, & Corbett (2008))
- Which content is most effective? (Ryan S. Baker, Gowda, & Salamin (2018))
P(T) Probability the skill will be learned at each opportunity to use the skill.

BKT: Extended Uses

Moment-by-moment learning estimation
(calculating P(T) in specific step)
Which moment-by-moment learning curves are associated with more robust learning? (Ryan S. Baker, Hershkovitz, Rossi, Goldstein, & Gowda (2013))
What behaviors predict “eureka” moments (Moore, Baker, & Gowda (2015))
Which types of content are associated with more learning? (Slater et al. (2016))

P(T) Probability the skill will be learned at each opportunity to use the skill.

BKT: Extended Uses

Detecting carelessness (contextual slip)
(calculating P(S) in specific step)
Predicts test score (Pardos, Baker, San Pedro, Gowda, & Gowda (2014)), college enrollment (M. O. Pedro, Baker, Bowers, & Heffernan (2013)), job several years later (Almeda & Baker (2020))

P(S) Probability the student will slip (make a mistake) if the skill is known.

BKT: Extended Uses

Transfer assessment
(adding P(T) from other skills)
Used to study relationship between skills (M. S. Pedro, Jiang, Paquette, Baker, & Gobert (2014))
Including in graduate students learning research skills across several years (Kang et al. (2022))

P(T) Probability the skill will be learned at each opportunity to use the skill.

Further Discussion

How can you apply these methods to your own research or practice?

What’s NEXT?

Complete the ASSISTments activit
Complete the badge requirement document

Thank you! Any questions?

More Detail on Advanced BKT

BKT has strong assumptions
One of the key assumptions is that parameters vary by skill, but are constant for all other factors
What happens if we remove this assumption?

BKT with modified assumptions

Conditionalizing Help or Learning
Contextual Guess and Slip
Moment by Moment Learning
Modeling Transfer Between Skills

Beck, Chang, Mostow, & Corbett 2008

Beck, J.E., Chang, K-m., Mostow, J., Corbett, A. (2008) Does Help Help? Introducing the Bayesian Evaluation and Assessment Methodology. Proceedings of the International Conference on Intelligent Tutoring Systems.

Notes

In this model, help use is not treated as direct evidence of not knowing the skill
Instead, it is used to choose between parameters
Makes two variants of each parameter
- One assuming help was requested
- One assuming that help was not requested

Beck, et al.’s (2008) Help Model

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Beck, et al.’s (2008) Help Model

Parameters per skill: 8

Fit using Expectation Maximization

-   Takes too long to fit using Grid Search

Beck, et al.’s (2008) Help Model

Notes

This model did not lead to better prediction of student performance
But useful for understanding effects of help

BKT with modified assumptions

Conditionalizing Help or Learning
Contextual Guess and Slip
Moment by Moment Learning
Modeling Transfer Between Skills

Contexual Guess-and-Slip

Baker, R.S.J.d., Corbett, A.T., Aleven, V. (2008) More Accurate Student Modeling Through Contextual Estimation of Slip and Guess Probabilities in Bayesian Knowledge Tracing. Proceedings of the 9th International Conference on Intelligent Tutoring Systems, 406-415.

Contexual Guess-and-Slip

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Contexual Slip: The Big Data

Why one parameter for slip
- For all situations

-   For each skill

When we can have a different prediction for slip
- For each situation
- Across all skills

In other words

P(S) varies according to context

For example
- Perhaps very quick actions are more likely to be slips
- Perhaps errors on actions which you’ve gotten right several times in a row are more likely to be slips

Contexual Guess and Slip Model

Guess and slip fit using contextual models across all skills
Parameters per skill: 2 + (P (S) model size)/skills + (P (G) model size)/skills

How are these models developed?

Take an existing skill model
Label a set of actions with the probability that each action is a guess or slip, using data about the future

Use these labels to machine-learn models that can predict the probability that an action is a guess or slip, without using data about the future

Use these machine-learned models to compute the probability that an action is a guess or slip, in knowledge tracing

How are these models developed?

2. Label a set of actions with the probability that each action is a guess or slip, using data about the future

Predict whether action at time N is guess/slip
Using data about actions at time N+1, N+2
This is only for labeling data! @
Not for use in the guess/slip models

How are these models developed?

2. Label a set of actions with the probability that each action is a guess or slip, using data about the future

The intuition:
If action N is right
And actions N+1, N+2 are also right
- It’s unlikely that action N was a guess
If actions N+1, N+2 were wrong
- It becomes more likely that action N was a guess
I’ll give an example of this math in few minutes…

How are these models developed?

3. Use these labels to machine-learn models that can predict the probability that an action is a guess or slip

Features distilled from logs of student interactions with tutor software
Broadly capture behavior indicative of learning
- Selected from same initial set of features previously used in detectors of
  - gaming the system (R. S. d. Baker, Corbett, Roll, & Koedinger (2008))
  - off-task behavior (R. Sj. Baker (2007))

How are these models developed?

3. Use these labels to machine-learn models that can predict the probability that an action is a guess or slip

Linear regression
1. Did better on cross-validation than fancier algorithms
One guess model
One slip model

How are these models developed?

4. Use these machine-learned models to compute the probability that an action is a guess or slip, in knowledge tracing

Within Bayesian Knowledge Tracing
Exact same formulas
Just substitute a contextual prediction about guessing and slipping for the prediction-for-each-skill

BKT with modified assumptions

Conditionalizing Help or Learning
Contextual Guess and Slip
Moment by Moment Learning
Modeling Transfer Between Skills

Moment-by-Moment Learning Model

Baker, R.S.J.d., Goldstein, A.B., Heffernan, N.T. (2011) Detecting Learning Moment-by-Moment. International Journal of Artificial Intelligence in Education, 21 (1-2), 5-25.

Moment-by-Moment Learning Model (Baker, Goldstein, & Heffernan, 2010)

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

P(J)

P(T) = chance you will learn if you didn’t know it
P(J) = probability you Just Learned
- P(J) = P(~L_n ^ T)

P(J) is distinct from P(T)

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(J) Probability you just learned

P(T) Probability the skill will be learned at each opportunity to use the skill.

Labeling P(J)

Based on this concept:
- “The probability a student did not know a skill but then learns it by doing the current problem, given their performance on the next two.”

P(J) = P(~Ln ^ T | A+1+2 )

*For full list of equations, see
Ryan SJD Baker, Goldstein, & Heffernan (2011)

P(J) Probability you just learned

Breaking down P(~L_n ^ T | A+1+2 )

Breaking down P(A+1+2 | L_n) P(L_n): One example

P(A+1+2 = C, C | Ln ) = P(~S)P(~S)

P(A+1+2 = C, ~C | Ln ) = P(~S)P(S)

P(A+1+2 = ~C, C | Ln ) = P(S)P(~S)

P(A+1+2 = ~C, ~C | Ln ) = P(S)P(S)

Example:

What about the probabilities of the next two actions, given that they knew it, times the probability that they knew it? Well, there are only 4 possibilities for those two actions.
- Correct, correct.
- Correct, not correct.
- Not correct, correct.
- And not correct, not correct.

And those turn out to be if you knew it, the probability of correct, correct if you knew it is the probability you didn’t slip, times the probability you didn’t slip.

The probability that you got it wrong and then right if you knew it was the probability you didn’t slip, times the probability you slipped, and so on.

This case is going to be more complicated for the case that you didn’t know it and you didn’t learn it because in that case, we’re trying to estimate the probability as we go forward based on the possibility that you may have learned it or not between the first and second attempts. So, the equations get really complicated at this point.

Features of P(J)

Distilled from logs of student interactions with tutor software
Broadly capture behavior indicative of learning
- Selected from same initial set of features previously used in detectors of
  - gaming the system (R. S. d. Baker, Corbett, Roll, et al. (2008))
  - off-task behavior (R. Sj. Baker (2007))
  - carelessness (R. S. d. Baker, Corbett, & Aleven (2008))

Features of P(J)

All features use only first response data

Later extension to include subsequent responses only increased model correlation very slightly – not significantly

Uses

Patterns in P(J) over time can be used to predict whether a student will be prepared for future learning (Hershkovitz, Baker, Gowda, & Corbett (2013), Ryan S. Baker et al. (2013)) and standardized exam scores (Jiang, Baker, Paquette, Pedro, & Heffernan (2015))
P(J) can be used as a proxy for Eureka moments in Cognitive Science research (Moore et al. (2015))

We then had a model that we could use for a few things.

We’re not using this model at any point to try to improve our prediction of student performance in the system. Instead, we’re looking at how we can use this in analysis. It turns out that patterns in P of J over time can be used to predict whether theit’s student is prepared for future learning:.
when they encounter the first piece of curriculum material beyond this current system can they actually learn from it and do well on a test. Patterns of P of J during the use of the system turn out to be predictive of this. Also, it turns out that P of J over time can be used to predict standard as exam scores.
in a third recent use, P of J can be used as a proxy for eureka moments in cognitive science research. We can look for moments where students had spectacularly high learning, say higher than 99 percent of all learning episodes, and say what distinguishes the behavior that precedes this.

Alternate Method

Assume at most one moment of learning
Try to infer when that single moment occurred, across entire sequence of student behavior

De Sande (2013) Pardos & Yudelson (2013)
Some good theoretical arguments for this – more closely matches assumptions of BKT
Has not yet been studied whether this approach has same predictive power as P(~Ln ^ T | A+1+2 ) method

BKT with modified assumptions

Conditionalizing Help or Learning
Contextual Guess and Slip
Moment by Moment Learning
Modeling Transfer Between Skills

Modeling Transfer Between Skills

Sao Pedro, M., Jiang, Y., Paquette, L., Baker, R.S., Gobert, J. (2014) Identifying Transfer of Inquiry Skills across Physical Science Simulations using Educational Data Mining. Proceedings of the 11th International Conference of the Learning Sciences.

How this model works

Classic BKT: Separate BKT model for each skill

BKT-PST (Partial Skill Transfer) M. S. Pedro et al. (2014): Each skill’s model can transfer in information from other skills
1. BKT-PST: One time (when switching skill)
2. BKT-PSTC Kang et al. (2022): At each time step

BKT-PST/PSTC Model

P(L₀) Probability the skill is already known before the first opportunity to use the skill

P(G) Probability the student will guess correctly if the skill is not known.

P(S) Probability the student will slip (make a mistake) if the skill is known.

P(T) Probability the skill will be learned at each opportunity to use the skill.

Uses

Used to study relationship between skills in science simulation

( M. S. Pedro et al. (2014))
Used to study which research skills help graduate students learn other research skills, across several years (Kang et al. (2022))

Uses

Contextualization approaches do not appear to lead to overall improvement on predicting within-tutor performance
But they can be useful for other purposes
- Predicting robust learning
- Understanding learning better
- Understanding relationships between skills

References

Almeda, M. V., & Baker, R. S. (2020). Predicting student participation in STEM careers: The role of affect and engagement during middle school. Journal of Educational Data Mining, 12(2), 33–47.

Baker, R. S. d., Corbett, A. T., & Aleven, V. (2008). More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. Intelligent Tutoring Systems: 9th International Conference, ITS 2008, Montreal, Canada, June 23-27, 2008 Proceedings 9, 406–415. Springer.

Baker, R. S. d., Corbett, A. T., Roll, I., & Koedinger, K. R. (2008). Developing a generalizable detector of when students game the system. User Modeling and User-Adapted Interaction, 18, 287–314.

Baker, Ryan SJD, Goldstein, A. B., & Heffernan, N. T. (2011). Detecting learning moment-by-moment. International Journal of Artificial Intelligence in Education, 21(1-2), 5–25.

Baker, Ryan S., Gowda, S. M., & Salamin, E. (2018). Modeling the learning that takes place between online assessments. Proceedings of the 26th International Conference on Computers in Education, 21–28.

Baker, Ryan S., Hershkovitz, A., Rossi, L. M., Goldstein, A. B., & Gowda, S. M. (2013). Predicting robust learning with the visual form of the moment-by-moment learning curve. Journal of the Learning Sciences, 22(4), 639–666.

Baker, R. Sj. (2007). Modeling and understanding students’ off-task behavior in intelligent tutoring systems. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1059–1068.

Beck, J. E., Chang, K., Mostow, J., & Corbett, A. (2008). Does help help? Introducing the bayesian evaluation and assessment methodology. Intelligent Tutoring Systems: 9th International Conference, ITS 2008, Montreal, Canada, June 23-27, 2008 Proceedings 9, 383–394. Springer.

Corbett, A. T., & Anderson, J. R. (1995). Knowledge tracing: Modelling the acquisition of procedural knowledge. User Model. User Adapt. Interact., 4(4), 253–278. https://doi.org/10.1007/BF01099821

De Sande, B. van. (2013). Properties of the bayesian knowledge tracing model. Journal of Educational Data Mining, 5(2), 1–10.

Hershkovitz, A., Baker, R., Gowda, S. M., & Corbett, A. T. (2013). Predicting future learning better using quantitative analysis of moment-by-moment learning. Educational Data Mining 2013.

Jiang, Y., Baker, R. S., Paquette, L., Pedro, M. S., & Heffernan, N. T. (2015). Learning, moment-by-moment and over the long term. Artificial Intelligence in Education: 17th International Conference, AIED 2015, Madrid, Spain, June 22-26, 2015. Proceedings 17, 654–657. Springer.

Kang, J., Baker, R., Feng, Z., Na, C., Granville, P., & Feldon, D. F. (2022). Detecting threshold concepts through bayesian knowledge tracing: Examining research skill development in biological sciences at the doctoral level. Instructional Science, 50(3), 475–497.

Moore, G. R., Baker, R. S., & Gowda, S. M. (2015). The antecedents of moments of learning. CogSci.

Pardos, Z. A., Baker, R. S., San Pedro, M. O., Gowda, S. M., & Gowda, S. M. (2014). Affective states and state tests: Investigating how affect and engagement during the school year predict end-of-year learning outcomes. Journal of Learning Analytics, 1(1), 107–128.

Pardos, Z. A., & Yudelson, M. V. (2013). Towards moment of learning accuracy. AIED 2013 Workshops Proceedings Volume, 4, 3. Citeseer.

Pedro, M. O., Baker, R., Bowers, A., & Heffernan, N. (2013). Predicting college enrollment from student interaction with an intelligent tutoring system in middle school. Educational Data Mining 2013.

Pedro, M. S., Jiang, Y., Paquette, L., Baker, R. S., & Gobert, J. (2014). Identifying transfer of inquiry skills across physical science simulations using educational data mining. Boulder, CO: International Society of the Learning Sciences.

Slater, S., & Baker, R. S. (2018). Degree of error in bayesian knowledge tracing estimates from differences in sample sizes. Behaviormetrika, 45(2), 475–493.

Slater, S., Baker, R., Ocumpaugh, J., Inventado, P., Scupelli, P., & Heffernan, N. (2016). Semantic features of math problems: Relationships to student learning and engagement. International Educational Data Mining Society.