Friday, June 3, 2011

Tracking down more audio files not found

Let's tackle the あびる problem from the last blog. An mp3 which contains the word ”あびる”, which means to bathe, wasn't found. Here the problem might be due to the fact that the audio file name is probably has a kanji for "bathe" instead, i.e. "浴びる", meaning it won't match. First, let's look for it in the database:

Yup, there are a couple of entries, one with the kanji and one without. So, what's the number?

ID 133, frequency 12778.

The question is how many audio files are there which have double entries like this? There might be a bunch, there might be just a few. This would require some ugly programming gymnastics to solve. I would probably be better off just either copying the file, or just throwing the kanji into the database. I'm going to go with option 2. I've meddled with the data before.

Ok, now I just need to remember to copy the db over again. I'll handle that later. I won't do it yet because it might need a patch for some other word later. Plus, I forget how I deleted the file from the sdcard, but I know I did it.

Ok, how about this problem:

We also have a mismatch on それ and それから. This is again due to the comparison, which matches the beginning and the end. So, they both must have the english meaning of "that".

Let's look it up on the database:

Right, one ends in "after that", the other is just "that". Well, since there are a bunch of spaces before it, let's just say
"endsWith(" " + word). That should do it.

Let's order by frequency and find the number of this one.

それ is level 5, id 339. So I'll select 5, order by freq, then search for 339 on SQLRazor, which I'm going to have to buy in seven days at $60. It's that good.

It's actually frequency ranking 5. Let's take a look.

Yep, that's it. Now, make the patch, on the 3rd line below.

String endStringNFD = Normalizer.normalize(endString, Normalizer.Form.NFD);
String searchEndStringNFD = endStringNFD.replaceAll(" ", "");
searchEndStringNFD = " " + searchEndStringNFD;

Hmmm. I'm kind of unhappy recreating the string like that. Well, let's make sure the fix works, at least, before optimizing it.

Hmmm...かける, call by phone isn't working. What number is that?

かける - number 153, frequency 57th.

Ok, it didn't work. Now, let's try it with the change backed out.

Still no good. This looks to be another situation like the one mentioned above for あびる. There may be more, especially at level 5, which is the easiest and has the fewest kanji.

Well, really the best thing to do would be to make a copy of the mp3. No, wrong. I should've thought of this earlier. It's a better idea to just search on the hiragana *or* the kanji. Give them both a shot. Yeah, probably that will work. Here's the current search:


if (nameNFD.endsWith(searchEndStringNFD)) {

if (nameNFD.startsWith(kanjiNFD)) {

return soundFile;
}


Here's the new search:

if (nameNFD.endsWith(searchEndStringNFD)) {


if ((nameNFD.startsWith(kanjiNFD))

||

(nameNFD.contains(kanaNFD))

) {


That's bit complicated. Let's give it a try.

No - it didn't work. かける 's definition is "to call by phone". Let's find that in the log.

Errr, right. the file's name ends with "to call by phone.mp3". So my last fix to get rid of blanks in the search term broke some code by squashing all the words together "tocallbyphone.mp3", making it no longer match the file. We'll have to find another solution. First let's put it back how it was.


//String searchEndStringNFD = endStringNFD.replaceAll(" ", "");
String searchEndStringNFD = endStringNFD;
searchEndStringNFD = " " + searchEndStringNFD;

Now, let's see if it finds it. Ok, good - it works. But we robbed peter to pay paul. We're back to the drawing board on the mismatch between "word,language" and "word, language".

Regular expressions are rearing their ugly head. I originally had them, but got messed up on the normalization stuff and threw them out. What I could do is split the ending file name by spaces and commas - can you do that? and string together the results to get a pattern like word*language.

Well, first let's fix bathe. Ok, that's done. I feel like testing it. It would really be wise to set up a few Robotium tests for this. This is where automated testing can really save you.

Well, I dunno. How would robotium capture that the files are being played?

Well, let's optimize this code.

String searchEndStringNFD = endStringNFD;
searchEndStringNFD = " " + searchEndStringNFD;

We'll use string builder

StringBuilder sb = new StringBuilder();
sb.append(" ");
sb.append(endStringNFD);
searchEndStringNFD = sb.toString();

That's definitely faster. Always use StringBuilder append instead of adding strings.

Let's do the same for the kana code.

StringBuilder sb = new StringBuilder();
sb.append(" ");
sb.append(kanaNFD);
sb.append(" ");
kanaNFD = sb.toString();


Even faster. I though ti was the log.d slowing it down, but it was these.

Ok, well, we took two steps forward and one step backward here. Let's call it a wrap for now.

No comments:

Post a Comment