Friday, May 13, 2011

Bringing word frequency into the game

I'd like to somehow incorporate frequency of use of the japanese words into my program. I'm thinking along the lines of setting up increasingly difficult game levels, with the most frequently used first, going down to those really rare ones for the experts. This is good for the customer because they will learn the most highly used words first.

For example, here is a table discussing the frequency of word use in Japanese based on online Japanese textbooks (from http://tinyurl.com/3poj4hu)

The first 100 words on the list make up 57.2% of the text that was processed.

The first 500 - 70.3%.

The first 1000 - 76.2%

The first 3000 - 85.4%

The first 10,000 - 94.1%

Thus, the vocabulary three easiest levels of the JLPT (approximately 3100 words) will give coverage for around 85% of the language. Not bad!

But, the real challenge will be mapping the frequency of word list found at the above site with the all_words list. This is where the SQL database will come in handy. I'd like to see how many matches I can make out of the 8453 words.

First, lets explore the spreadsheet which has the frequency of usage. It has no less than 89,572 entries. The columns are sequence or id, number of occurrences, the word itself, and the type of word (noun, verb etc.) With such a large pool to choose from, it seems more likely that we'll get a lot of matches on our 8k or so words.

The first, now familiar step is to export the data to a csv table. The column heads will be frequency, occurrences, word part_of_speech. Done. Now let's do the import via RazerSQL.

That took a while, even with batch by 500 updates turned out. Brought milk, called the phone company, even mowed the lawn. Now that it's done, let's take a look. Seem's to be fine. Now, let's join them

Select a.level, a.number, a.kanji, a.hiragana, a._id, b.freq
from all_words a, freq_list b
where a.kanji = b.word

This is also taking a while...maybe I should add an index to word. I hope this isn't one of those cases where it does a full scan of one of the tables for each row on the other - an n squared situation. No, that was when checking for *not* equals. I remember there was a way to avoid that using a subselect. I think it must be something like having the subselect find all the records that aren't equal, which is only one pass, and then having the basic select have a where clause specifying the value is in the subselected data.

Anyway, the output has started appearing in chunks of 2000. It's up to 6000 already. Ok, the total matched is 6957 words, out of 8453 words. I have two choices - make this an outer join, and figure out a way to deal with the missing frequencies, or an inner join and avoid the problem entirely. Maybe some kind of randomization scheme, or just give it the same value as the it's nearest neighbor.

I added an index to both word columns; and it runs almost instantly. Nothing like an index to move things along in the right situation.


First, I need to create both of the possible tables. Something like this:


INSERT INTO all_word_freq_inner
(level, number, kanji, hiragana, id, freq)
Select a.level, a.number, a.kanji, a.hiragana, a._id, b.freq
from all_words a, freq_list b
where a.kanji = b.word

Or, perhaps a view:

CREATE VIEW all_words_freq_inner AS Select a.level, a.number, a.kanji, a.hiragana, a._id, b.freq
from all_words a, freq_list b
where a.kanji = b.word

This second one worked. It's easier, since I don't have to create a table and fuss with deleting the old table. Since sqlite supports view, I might as well use them. That will give me an option to use either the inner join or the outer join. I'm still planning to use the outer join, but I need to fill the missing frequency with high values or something so the customer will see the ones with frequency assigned first. Of course, that poses a problem because there will be no randomization if I do it *purely* by frequency. Actually, what I'll do is randomize within game levels, which will be delineated by say, 10 words grouped by frequency. Actually, I should assign the ones missing frequencies with *negative* numbers. That assures they will appear last. If I could do it sequentially, instead of the same value, I could use the same logic for all the words. But, now we're back to creating a physical file.

If I did that, I could use a stored procedure to assign the negative values. Something like a modified version of the following:

CREATE PROCEDURE inslist(i integer) AS
BEGIN
WHILE i>0 LOOP
INSERT INTO mytable VALUES (i);
i := i - 1;
END LOOP;
RETURN 'ok';
END;

Let's first create the physical table. I just realized, I had created the initial table using text for the integers. We'll fix that plus add the frequency with this copy. First the create:


CREATE TABLE all_words_freq_outer
(level INTEGER,
freq INTEGER,
number INTEGER,
kanji TEXT(25),
hiragana TEXT(25),
english TEXT(50),
"_id" INTEGER)

And test the outer join:

Select all_words.level, all_words.number, all_words.kanji, all_words.hiragana, all_words._id, freq_list.freq
from all_words left outer join freq_list
on all_words.kanji = freq_list.word

Ok, on viewing the output, a lot of the missing words were because there is no kanji in some cases - only the hiragana.

What happens if we join on the kana?

Select all_words.level, all_words.number, all_words.kanji, all_words.hiragana, all_words._id, freq_list.freq
from all_words left outer join freq_list
on all_words.hiragana = freq_list.word

It looks like I'll definitely be able to fill in at least some of the blanks. For example, 73 and 74, iro and iroiro, are the kanji an hiragana only for various and color. There may be some overlap. I definitely need to give preference to the kanji, because there's less ambiguity there.

Well, let's create the table using the kanji match first.

INSERT INTO all_words_freq_outer
(level, number, kanji, hiragana, english, _id, freq)
Select all_words.level, all_words.number, all_words.kanji, all_words.hiragana, all_words.english, all_words._id, freq_list.freq
from all_words left outer join freq_list
on all_words.kanji = freq_list.word

select count(*) from all_words_freq_outer
shows 8452 - we dropped one somewhere along the way.

Ok, now the big question is how to take the value of the second list and modify it - but only of the field is blank. I think that old substitute for id might help. Something like

UPDATE all_words_freq_outer SET allword_freq_outer.freq =
Select freq_list.freq
from all_words left outer join freq_list
on all_words.hiragana = freq_list.word
where freq is null

Here's something from google search which might help

update TableA set A2 =
(select B2 from TableB where A1=B1)
where A1 in (select B1 from TableB);

That might be a bridge to far - I should be happy the insert based on select worked.

Another idea is to create a separate view or table of only those in the outer table missing a frequency. Then join that with the the frequency table on hiragana.

Let's get back to simple. First I want to take any rows where frequency is null.

SELECT * from all_words_freq_outer WHERE freq is null;

Good. Now, to simplify the hoped for update, let's create a view, Ok, now we have a listing of any words is null.

Select * from v_all_words_freq_null

Now let's join that with the frequency table.

Select a.level, a.number, a.kanji, a.hiragana, a.english, a._id, b.freq
from v_all_words_freq_null a, freq_list b
where a.hiragana = b.word;

Wow - 1200 more rows.

Let's create a view out of that

Create view v_all_words_freq_hiragana_only_match as Select a.level, a.number, a.kanji, a.hiragana, a.english, a._id, b.freq
from v_all_words_freq_null a, freq_list b
where a.hiragana = b.word;

The following:

select * from v_all_words_freq_hiragana_only_match

shows 1247 additional matches.

Now we have a view of all the rows which found a match on the kana, and not on the kanji.

Let's make a view of all the rows that matching on the kanji:

Create view v_all_words_freq_kanji_match as Select a.level, a.number, a.kanji, a.hiragana, a.english, a._id, b.freq
from all_words a, freq_list b
where a.kanji = b.word;

Thats 6957 plus 1247 = 8157 + 47 = 8204? So we're down to 8452 - 8204 = 248 words only missing frequencies - tiny percentage. I might not have to go through the whole assigning negative numbers if it's that low. Maybe.

Ok, we have two out of our three sets. The third is of course the aforementioned tiny group with no match at all. What I need to do is append the two just created views together, and then do an outer join on the original all words file, and grab the nulls from that, and append that two the previous two views. Or I could just drop those without a frequency...

That's enough for now. We made some good progress. The next blog post will focus on unifying the views into a single table - with frequency included :)

No comments:

Post a Comment