Friday, May 13, 2011

Cleaning up loose ends

In the last posting, we explored whether it was worth it to chase down about 250 or so words without a frequency assigned to them, and decided, nope. Instead, we punted and assigned a unique negative number to each unmatched word. This is sufficient to be able to treat the problem without messy if statements - except perhaps a non-display of the frequency number in such a case.

This post, then, is simply about cleaning up our current sqlite databse, which is currently strewn with views and tables spawned in the nonetheless successful effort to assign a frequency of word use number to the japanese words in the database. The first order of the day, then, is to backup up the database. Although this can be done with a file copy, we'll use our trial version of RazerSQL to do it from a menu.

That competed, we want to move the data from all_words_temp1 back into all_words.

First, clear out the table:

delete from all_words

Now copy into it:

INSERT INTO all_words SELECT * FROM all_words_temp1

It looks like we have an extra column somewhere. Ok, all_words doesn't have freq. Let's drop the table and recreate it.

Dropped, now create:

CREATE TABLE all_words
(
_id INTEGER,
level INTEGER,
number INTEGER,
kanji TEXT(25),
hiragana TEXT(25),
english TEXT(50),
freq INTEGER
)


And try again:

INSERT INTO all_words_bak

SELECT
_id, level, number, kanji, hiragana, english, freq
FROM
all_words

Check the count:

SELECT
count(*)
FROM
all_words

8449.

That sounds right

Ok, it's time to start deleting - everything except all_words, and that android_metadata table.

Hmm...what about indexes, and views that will be needed on the android? I can do it iteratively, but I think I need to put one on freq, at least.

CREATE INDEX level_index
ON all_words (level)

CREATE INDEX number_index
ON all_words (number)

CREATE INDEX freq_index
ON all_words (freq)

And create a view that sorts on level and freq


CREATE VIEW v_all_words_by_level_freq_desc AS SELECT _id, level, number, kanji, hiragana, english, freq
FROM all_words
ORDER BY level, freq DESC

Whoah - I just realized that I have been assuming the frequency was the actual frequency - the most highly used words having the highest number - but actually it's an assigned rank. So, the most frequent number is the lowest, not the highest.

This poses a bit of a problem in terms of assignment of the null frequencies - they have to be higher than any frequency at their levels. What I could do, though, is figure out what the highest number is, and then, add the _id to that constant.

Let's backup the table...

CREATE TABLE all_words_bak
(
_id INTEGER,
level INTEGER,
number INTEGER,
kanji TEXT(25),
hiragana TEXT(25),
english TEXT(50),
freq INTEGER
)



INSERT INTO all_words_bak

SELECT
_id, level, number, kanji, hiragana, english, freq
FROM
all_words



Now:


What's the highest frequency in the table?

SELECT _id, level, number, kanji, hiragana, english, freq
FROM all_words
ORDER BY freq DESC

(I'm sure there's a better way to do the above, like a high function or something, but I don't care. )

89385. So, as long as the frequency assigned is greater than 89385 and unique, we'll get a consistent order.

Let's update using this:

UPDATE all_words SET freq = (_id + 100000) where freq < 0

Let's check it out:

AS SELECT _id, level, number, kanji, hiragana, english, freq
FROM all_words
ORDER BY level, freq


CREATE VIEW v_all_words_by_level_freq AS SELECT _id, level, number, kanji, hiragana, english, freq
FROM all_words
ORDER BY level, freq

Looks good. Ok. That wraps up this session. Next is working through how to update the database to a new version on the android.

No comments:

Post a Comment