Friday, May 13, 2011

Cleaning up the frequency....

In the last post, I managed to get ten more word frequencies assigned by jumping through some SQL hoops. In this post, we'll examine what other words remain with unassigned frequencies, and devise a strategy for handling them - or, perhaps, figuring out how best to punt.

First, how many remain?

select * from all_words_temp1 where freq is null

235 remaining.


The missing frequencies fall into several groups:

1. Double-words separated by a "/" such as "いい/よい" .

2. Words that are really two words, such as "すぐに".

3. Number words with counters like "ようか".

4. Idiomatic conjugations such as "ください"

5. Double words separated by a "," such as "より、ほう"

6. Double words separated by a raised point such as "あいさつ・する".

7. Words ending in "さん" such as "おくさん",

8. Words starting with the honorable "ご" such as "ごくろうさま".

9. Words that just don't match, although you would think they would.

Level 5 has 29 nulls out of 669, 4.3%.
Level 4 has 30 nulls out of 634, 4.73
Level 3 has 17 nulls out of 1835 .92%
Level 2 has 85 nulls out of 1835, 4.6%
Level 1 has 74 nulls out of 3476, 2.1%

Altogether, there are 235 nulls out of 8214 rows, so we have overall coverage of 97%. I'm going to punt and say that's adequate. The customer will still be learning those words - just at the highest game level for that jlpt test level. If a level has 10 words, that's a maximum of 9 out of 183 levels for level 3.

The question is, how to deal with that 3% programmatically? I think the simplest thing is just assign them negative numbers. That way, they'll appear last. What I can do hopefully is just use a negative of the id number and assign that to frequency. So, something like

UPDATE all_words_temp1 SET freq = (_id * -1) where freq is null

should do the trick. Let's backup all words first.


CREATE TABLE all_words_bak2
(level INTEGER,
freq INTEGER,
number INTEGER,
kanji TEXT(25),
hiragana TEXT(25),
english TEXT(50),
_id INTEGER)


Then backup:
INSERT INTO all_words_bak2 SELECT * FROM all_words_temp1

And run the update:
UPDATE all_words_temp1 SET freq = (_id * -1) where freq is null

Perfect. We're almost ready to move on to updating the database on the Android - after backing up and cleaning up.

No comments:

Post a Comment