Friday, June 3, 2011

Refining the search with regex

Now we're at the point where the bugs are getting trickier to solve. Since a lot of them are dealing with the minutiae of dealing with text, it's not all that fun or appealing. Especially if you're not sure if you're solving one or 100 instances!

But, we have isolated a problem - as I said in the last post, we're back to the drawing board on the mismatch between "word,language" and "word, language".

First, let's figure out what the frequency ranking is. The is really just derived by sorting by frequency of use within level, and taking wherever the word fall.

select * from all_words where level = 5 order by freq

Then use the slick RasorSQL's result search for the text.

Ok, it looks like it's number 55. Let's double-check that it fails. Yep - no audio.

Ok, what we want to do is split the last word up space, the recreate a string, turning "word, language" into the regex "word,*language"

So, now we have some code like this:

String[] temp;

/* delimiter */
String delimiter = " ";

/* given string will be split by the argument delimiter provided. */
temp = endStringNFD.split(delimiter);

sb = new StringBuilder();
sb.append(" ");

for (int i = 0; i < temp.length; i++) {
sb.append(temp[i]);
sb.append('*');
}


String searchEndString = sb.toString();

Now the question is how to match the end of the file name being searched to the regular expression.

Really, we just have to take this:


if (nameNFD.endsWith(searchEndString)) {

and make it a regex search.


if (nameNFD.matches(searchEndString)) {

But, to make it more accurate, we should specify it's at the end of the string. Also, that it doesn't start with blanks:


sb = new StringBuilder();

sb.append(".*");
sb.append(" ");

for (int i = 0; i < temp.length; i++) {
sb.append(temp[i]);
sb.append('.*');
}

I really need to test this.


http://www.regexplanet.com/simple/index.html


This expression:

.*word.*

matches:

" word, language"

Ok, so, ultimately I come up with this:

^.* word,.*language\.mp3$

Or as a java string:

"^.* word,.*language\\.mp3$"

Now, what about the beginning? Well, first let's test this part.

So, here's our code now:

//String endString = question.english.trim().toLowerCase() + ".mp3";
String endString = question.english.trim().toLowerCase();

String endStringNFD = Normalizer.normalize(endString,
Normalizer.Form.NFD);

String[] temp;

/* delimiter */
String delimiter = " ";

/* given string will be split by the argument delimiter provided. */
temp = endStringNFD.split(delimiter);

sb = new StringBuilder();
sb.append(".* ");

for (int i = 0; i < temp.length; i++) {
sb.append(temp[i]);
sb.append(".*");
}

sb.append("\\.mp3$");
// sb.append(endStringNFD);

String searchEndString = sb.toString();

//Log.d(TAG, ">>>>>>>>>> searchEndString: " + searchEndString);

Log.d(TAG, ">>>>>> file name: " + soundFile.getName());
Log.d(TAG, ">>>>>> nameNFD: " + nameNFD + "x");
Log.d(TAG, ">>>>>> kanjiNFD: " + kanjiNFD);
Log.d(TAG, ">>>>>> searchEndString: " + searchEndString);

//if (nameNFD.endsWith(searchEndString)) {
if (nameNFD.matches(searchEndString)) {

Let's see what happens...it matched! Nice. Well, let's take it out for a test spin - 1 through 10.

Almost made it. The ichi - played as hitostuki?

Here's the search end string:

searchEndString: .* one.*\.mp3$


Here's the file that was found:

filefound - name is: 一月 ひとつき one month.mp3

There are two issues; the first is it should start with the kanji plus a space; the second is that there shouldn't be a .* after the final "one".

We can do them both, although the first is easier.

String kanjiNFD = Normalizer.normalize(question.kanji,
Normalizer.Form.NFD);

StringBuilder sb = new StringBuilder();

sb.append(kanjiNFD);
sb.append(" ");
kanjiNFD = sb.toString();

For the loop, we'll just check if we're on the last entry and skip the append of the ".*" if so.

for (int i = 0; i < temp.length; i++) {
sb.append(temp[i]);

// don't append the last because we want it to end on the last entry in the array
if (i < (temp.length -1){
sb.append(".*");
}
}

Let's give it another try...

Rats. I've lost "aru". What did it? Let's back out the loop change.

Yes - now it works. What is the problem? Here's the file name:

ある to be,to have (used for inanimate objects).mp3

Here's the expression:

.* to.*be,to.*have.*(used.*for.*inanimate.*objects).*\.mp3$

When it didn't work:

.* to.*be,to.*have.*(used.*for.*inanimate.*objects)\.mp3$

Why not? Is it a special character? Yes, that must be it. Rats. It probably has to be escaped, or something. Hmmm...can you escape them all? I'm curious....

for (int i = 0; i < temp.length; i++) {
sb.append("\\");
sb.append(temp[i]);

// don't append the last because we want it to end on the last entry in the array
//if (i < (temp.length - 1)){
sb.append(".*");
//}
}

Nope. Ok. Let's leave it as it is for now. Things have slowed down a bit. We will need to speed them up in the next post.

No comments:

Post a Comment