Monday, June 22, 2009

Start Hacking Montezuma

It is very bad that I have suspended my study process of LISP for some weeks. I hope I should concentrate things which are meaningful and off those meaningless things, such as argue in forum and read entertainment story.

To follow Lesie's suggestion, I prepare to hacking Montezuma. First, I have read the treatise <An Object-Oriented Architecture for Text Retrieval>. It is a great paper, It uses an elegant, simple approach to accommodate a scalable complex architecture. I aslo understand Montezuma can not use the code samples.

I suppose fix bugs is a good start to get involve in a open source project, :).

standard tokenizer hangs on some input

As Edi Weitz pointed out, the culprit is the complex regular expression(method, token-regexp) in standard-tokenizer.lisp, and I have reduced the problem into a simple case:

CL-USER> (cl-ppcre:scan
              (cl-ppcre:create-scanner
                 "(_\\w+)*\\@\\w+") "_______________________________________"
                          :start 0)
;; Evaluation aborted.

I speculate that '\w' includes underscore in regular expression would account for this bug. and replace with other character of '\w' cause it too.

CL-USER> (cl-ppcre:scan (cl-ppcre:create-scanner
               "(a\\w+)*\\@\\w+") "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
               :start 0)
;; Evaluation aborted.

cl-ppcre is a perl-compatible regular expressions library, I should check it in Perl. Maybe perl is more efficient in regular expression operation, I raise the number of underscores, but it is OK.

$str = "john._______________________________________
__________________________________";

if ($str =~ m/(_*\w+)*\@\w+/)
{
   print "ok\n";
}

To conclude, it isn't montezuma's bug but cl-ppcre.

broken :must-not-occur or phrase query

I found query "html-template !\"edi weitz\"" is OK in my test corpus, but if I tried query "html-template !edi !test", it tell me:

"Invalid initialization argument: SCORER in call for class #<STANDARD-CLASS DISJUNCTION-SUM-SCORER>.".

It is obvious that there is not slot named scorer in disjunction-sum-scorer class, it maybe a typo, scorer should be substituted by sub-scorers. I modified it at line 199 of boolean-scorer.lisp, It is OK and all unit test are passed too.

No comments:

Post a Comment