KT Learning Lab 5: A Conceptual Overview
Up until this point we’ve been talking about predicting future correctness
Mostly considered in the context of memory for facts, rather than skills
How do you say banana in Spanish?
What is the capital of New York?
Where are the Islets of Langerhans?
Flashcard apps
Language learning apps
It has long been known that spaced practice (i.e. pausing between studying the same fact) is better than massed practice (i.e. cramming)
Early adaptive systems implemented this behavior in simple ways (i.e. Leitner, 1972)
It’s long been known that spaced practice, or in other words, pausing between studying the same fact, is better than massed practice (i.e. cramming for an exam).
Early adaptive systems like Leitner’s flashcards (1972) implemented this behavior in simple ways.
We start our discussion of algorithms for modeling memory with the ACT-R Memory Equations.
In Pavlik Jr & Anderson’s (2005) ACT-R memory equations, memory duration can be understood in terms of memory strength, which is sometimes referred to as activation.
\[ P(m) = \frac{1}{1+e^{\frac{\tau-m}{s}}} \]
Where m = activation strength of current fact
t = threshold parameter for how hard it is to remember
s is noise parameter for how sensitive memory is to changes in activation
Note logistic function (like PFA)
The formula for the probability of remembering in ACT-R is based on three parameters:
m, the activation strength of the current fact
tau, the threshold parameter for how hard it is to remember
s, the noise parameter for how sensitive memory is to changes in activation. In other words, when you re-encounter a fact, how much better does your memory get?
*Note here, that this was not building off PFA, PFA was building off of this.
\[ m_{n}(t_{1..n}) = \ln ()\sum_i^n t_{i}^{-d} \]
We have a sequence of n cases where the learner encountered the fact
Each 𝑡_𝑖 represents how long ago the learner encountered the fact for the i-th time
The decay parameter d represents the speed of forgetting under exponential decay
The formula for the activation is represented by this formula, where we have a sequence of n cases where the learner encountered the fact. Each t of i represents how long ago the learner encountered the fact
for the i-th time. And the decay parameter d represents the speed of forgetting under exponential decay.
So in other words, based on the parameters of this model, we can kind of infer — how much will your memory decay over time, and how rapidly it will decay over time.
Implications
More practice = better memory
More time between practices = better memory
Most efficient learning comes from dense practice followed by expanding amounts of time in between practices (Pavlik & Anderson, 2008)
There are a couple of implications for the ACT-R memory equations:
First, more practice equals better memory. That’s an implication here, and it’s kind of true in the real world. You’re probably more likely to remember something if you encounter it more
Also, more time between practices equals better memory. That’s true of almost all the memory models. But one kind of interesting implication of Pavlik Jr & Anderson’s model (2005) is that the most efficient learning comes from dense practice followed by expanding amounts of time in between practices (Pavlik Jr & Anderson, 2008).
Postulates that decay speed drops, the more times a fact is encountered
Functionally complex model where
Knowledge strength (and therefore probability of remembering) is a function of the sum of the traces’ actual contributions, divided by the product of their potential contributions
Power function is estimated as a combination of exponential functions
A more recent competitor to ACT-R memory equations is MCM by Mozer et al., (2009) and his colleagues. This model postulates that the decay speed drops, the more times the facts are encountered. So in ACT-R, the decay speed is constant whether you’ve encountered something one time or a million times.
But in MCM, the more times you’ve encountered a fact, the slower it is to decay.
MCM is represented by a functionally complex model where knowledge strength, and therefore the probability of remembering, is a function of the sum of the traces’ actual contributions divided by the product of their potential contributions. A power function is estimated as a combination of exponential functions. Each encounter with the knowledge has an exponential function for decay, but it turns out to sum up to a power function.
DASH Extends previous approaches to also include item difficulty and latent student ability
Can use either MCM or ACT-R as its internal representation of how memory decays over time
Building on that, Mozer & Lindsay (2016) introduced the DASH framework:
Which extends previous approaches to also include item difficulty and latent student ability
DASH has a neat feature. It can use either MCM or ACT-R or other frameworks as its internal representation of how memory decays over time
So whether or not you like ACT-R or MCM better, you can use DASH to also include item difficulty and latent student ability in your estimate of student forgetting and memory over time
Fits regression model to predict both recall and estimated half-life of memory (based on lag time)
Based on estimate of exponential decay of memory
Also very recently, Duolingo — fits a regression model to predict both the recall and the estimated half-life of memory based on the lag time. It’s based on an estimate of the exponential decay of memory.
Uses feature set including
Time since word last seen
Total number of times student has seen the word
Total number of times student has correctly recalled the word
Total number of times student has failed to recalled the word Word difficulty
But Duolingo does this calculation (Settles & Meeder, 2016):
Not based on the kind of complex algorithms that are recursive or iterative in nature like seen in Pavlik or Mozer, but instead uses a feature set including the time since the word last seen, the total number of times the student seen the word, total number of times the student’s correctly recalled the word or failed to recall the word, and the word difficulty
So it tries to capture some of the same ideas as DASH in a formulation that is quicker to implement, and quicker to run in real-time
Spreading Activation
Encountering or recalling something in memory also increases memory activation of related concepts/facts/ideas (Anderson, 1983)
Ma, Hettiarachchi, Fukui, & Ando (2023) build a DKT-family algorithm for memory that uses associations between items along these lines
Following slides content
Following slides content
You care about memory for specific items
Forgetting is a real concern – the student can do it today, not tomorrow
Relatively small amounts of data OK
Once you have a memory model, you can safely add new items to it and it will work
Following slides content