Wednesday, June 8, 2011

Spaced repetition learning systems (SRS) - how it might be done

I've been considering implementing a "Spaced Repetition System" in my quiz program. How it works is it focuses on not just short term but long term memory. So, if you remember something after reviewing it - great. But, unless you see it repeated, it's going to slip into forgetfulness. What SRS does is schedule repetitons, even if you got it right. The time between repetitions varies depending on how easily you got the answer before. If you got it wrong, well, you should see it pretty soon, like in the same quiz. If you got it right, if you took a long time, you should see it relatively soon. And if you got it quickly, maybe you can wait a while before reviewing it again.

So. How are we going to implement this? The current algorithm just tracks the answer you've got right / wrong in the current level. If you get them all right, you get to go to the next level. Otherwise, repeat the level. Nice and simple.

So, say you get one wrong or it times out. I'm saying, you should see that again - right away, as in, the next two or three questions. What if you take a while to answer? Say, more than 2-3 seconds? I'm still saying, you should see it again in the same run. What if you get it almost immediately? Say, 3 seconds or less? Then it's considered right for the purposes of the current run.

What about the srs over days? Well, the normal system let's you judge how well you did, and reschedules based on your estimate. But what I'm proposing sort of drills you until you get them all to the same time in the current level. So, were does the srs come in?

Well, I would say, since they are all at the same level coming out,

Well, that doesn't really work. I might as well just set the timer at two seconds and be done with it, otherwise, what's the point?
So, if they get it wrong, they are stuck in the same level. That works. The question is, do I make them repeat everything it that level? There is some value to that. I think maybe I'll stick with that.

Now - the long term repetition. Let's say they get it right in under 2 seconds. That's "by heart". That will have the longest repeat, and will be scheduled in say, seven days.

The next one is between 2 and five. That's we'll say, repeat in 2 days.

If it's 5-10, that's one day.

So, how to implement that?

Well, say you have an input list. Right now it's a randomized list based on the start and ending numbers.

It becomes clear that at least the "review" portion should be based on a schedule for the day. That's simple. Anything scheduled for the day, shows up in the "review" portion of the quiz.

So, what if they advanced to a new level? We need to have the new level obviously based on the start and end numbers - as long as they haven't already been scheduled.

So, the selection criteria should be either between the start and end quiz number and unscheduled, or scheduled for today.

So, if they get it wrong, it will be immediately rescheduled for today, and they will run into it once they move to the next level.

But, what about the rule that they have to get them all right on the curent level before they move on? They could have 100 reviewed, and missed just one, and will have to do them all over again, and miss getting a different one wrong.

This is infeasible. One option is to reduce the "re-do" list to the ones they got wrong. This could happen automatically by rescheduling, though. It will "reread" the database, and only the ones they've gotten wrong will come up.

There's on issue though. How to get rid of cards you consistently get right? Anki seems to have a good solution. In their faq, they say the interval times are doubled each time you get an answer right. This is important, because you need to get rid of those.

To simplify, say we base it on right/wrong and forget about the delays.

From the Anki Faq:

"When you use Anki every day, each time a card is answered correctly, it gets a bigger interval. Let’s assume that good about doubles the interval. Thus you have a 5 day wait, then a 10 day wait, 20 days, 40 days, and so on.

When people return to their deck after weeks or months of no study, they’re often surprised by the length intervals have grown to. This is because Anki considers the actual time the card was unseen, not just the time it was scheduled for. Thus if the card was scheduled for 5 days but you didn’t study for a month, the next interval will be closer to 60 days than 10 days."

In other words, it doesn't look at the number of times you have gotten it correct. It looks at the number of days passed since you got the last one correct. If it's been a lot of days, your doing pretty well - it's basically made it into your long term memory - so it doesn't just double the last interval. It calculates the difference between your the last time you answered it (correctly) and bases the doubling on that interval. So the formula is something like..

So how to express this as a formula?

Good answer = 2 * (today - last scheduled date)

So, if the difference is 5 days, it will become 10.

If it's 30 days, it will become 60.

There also needs to be an initial delay, i.e. the first time they got it right, you just assign say that 5-day interval.

So, good is 5 days.

Medium 3 days

slow 1 day.

If it's wrong, you don't get out of the level - it's scheduled for the same day.

And, once an answer is created, good means you double he inteval.

Medium is x 1.5

slow is 1.25

So. How would this work?

Ok. Say the next scheduled date is updated at the time of the answer.

if it's wrong, reschedule for today
if it's slow, reschedule for 1 day
if it's medium, reschedule for 3 days.

if it's good,
if this there is no previous schedule date, schedule for today + 5
else
reschedule for today + (today - last date) * 2

So, the select would be


Select where level >= first quiz number and <= last quiz number and scheduled date = blank

or
scheduled date <= today




Ok, for now, we need to add a "reschedule" date to the database. It can be on the same table as "all_words". Maybe. It might be better to have a separate table, since it's dynamic data. It could do an outer join, where scheduled date is either null or todays date.

Let's add that table the database.

CREATE TABLE kanji_review_schedule
("_id" INTEGER,
rating INTEGER DEFAULT 0,
last_review_date TEXT(25),
next_review_date TEXT(25))

Ok. Now, maybe we should add a view which will pick out the data we need.

Select * from kanji_review_where level >= first quiz number and <= last quiz number and rating = 0
or (today - <= next_review_date)

We really don't need last review date; but it might be helpful at some point.

Ok, let's insert some data.

Here are some relevant time functions:


Function Equivalent strftime()
date(...) strftime('%Y-%m-%d', ...)
time(...) strftime('%H:%M:%S', ...)
datetime(...) strftime('%Y-%m-%d %H:%M:%S', ...)
julianday(...) strftime('%J', ...)


Here are some relevant timestring functions:

The functions all take and return a timestring format.

Time Strings

A time string can be in any of the following formats:

YYYY-MM-DD
YYYY-MM-DD HH:MM
YYYY-MM-DD HH:MM:SS
YYYY-MM-DD HH:MM:SS.SSS
YYYY-MM-DDTHH:MM
YYYY-MM-DDTHH:MM:SS
YYYY-MM-DDTHH:MM:SS.SSS
HH:MM
HH:MM:SS
HH:MM:SS.SSS
now
DDDDDDDDDD


We're interested in the date going in and coming out - The date() function returns the date in this format: YYYY-MM-DD. So, if you get something right near midnight, you could soon see it again if you took too long.

Let's try an update:

update kanji_review_schedule set last_review_date = date('2011-01-01')
where _id = 1;

Okay that worked.

Now, we want to form our select statement.

So, let's try the join first.

Select a.id, a. from all_words a, kanji_review_schedule a
where a.id = b.id

Let's create a couple of indexes:

CREATE INDEX "_id_index"
ON kanji_review_schedule ("_id", next_review_date)

Ok, here's the first join:

Select level, number, kanji, hiragana, english, freq, rating, last_review_date, next_review_date from
all_words a, kanji_review_schedule b
where a._id = b._id

Hmm...it just occurred to me that I should be inserting records into the database for the kanji review table instead of pre-creating them. So, I should do an outer join.

First delete all the rows. Change the name to word_review_schedule.

insert into word_review_schedule (_id, rating, last_review_date,
next_review_date) values (1, 0, "", "")

Ok. Try this:

Select all_words._id, level, number, kanji, hiragana, english, freq, rating, last_review_date,
next_review_date
from all_words
left join word_review_schedule
on all_words._id = word_review_schedule._id

Good. You can't apparently use table aliases (e.g. tablename a, etc) with SQL lite.

Actually, didn't we want to select this in the frequency table?

This works.

Select v_all_words_by_level_freq._id, level, number, kanji, hiragana, english, freq, rating, last_review_date,
next_review_date
from v_all_words_by_level_freq
left join word_review_schedule
on v_all_words_by_level_freq._id = word_review_schedule._id

Ok, now let's add a where clause

Since the frequency is assigned in a kind of a strange way - it's actually assigned by the program as it iterates through the table in order by frequency rating - it's really not the most straightforward way to handle it.

So, the new ones, I just have to put a filter on it. The quiz numbers are put into an array and the questions doled out after being shuffled, then accessed sequentially.

So, step one is to filter those questions to include only those with a rating of zero - the new ones.

Then, the next step will be to add to those anything that has a review date less than today. It doesn't have to be in order, because they will get shuffled anyway.

That's the input. We'll pick up tomorrow on the output, and the code

1 comment: