Sunday, August 28, 2011

Thesaurus

Steve Hanov blogged about building a thesaurus using a "zero load time" file formats. Below, we translate his implementation into Factor.

You can download the 11 MB thesaurus data file we will be using (containing over 100,000 words and their lists of related words). It is implemented as a single file with a custom binary file format that looks like this:

[ header ]
4 bytes: number of words

[ index section ]
# The words are listed in alphabetical order, so you
# can look one up using binary search.
for each word:
    4 byte pointer to word record

[ word section ]
for each word:
   null terminated text
   4 bytes: number of related words
   for each link:
       pointer to linked word record

Build It

The data file consists of 4 byte "pointers" and null-terminated strings. We can build words to read an integer or a string from a particular location in the file:
: read-int ( ptr -- n )
    seek-absolute seek-input 4 read le> ;

: read-string ( ptr -- string )
    seek-absolute seek-input "\0" read-until drop >string ;
The number of words in the thesaurus is at the beginning of the file:
: #words ( -- n ) 0 read-int ;
The position of each word is found by reading the "nth" index:
: word-position ( n -- ptr ) 4 * 4 + read-int ;
The "nth" word is the string found at the specified word position:
: nth-word ( n -- word ) word-position read-string ;
Now for the fun part. Knowing that the index is sorted, we can build a word that performs a binary search for a particular word using the index.
:: find-word ( word -- n )
    #words :> high! -1 :> low! f :> candidate!
    [ high low - 1 > ] [
        high low + 2 /i :> probe
        probe nth-word candidate!
        candidate word <=> {
            { +eq+ [ probe high! probe low! ] }
            { +lt+ [ probe low! ] }
            [ drop probe high! ]
        } case
    ] while candidate word = [ high ] [ f ] if ;
Once we found the word that we are looking for, we can read its related words.
:: find-related ( word -- words )
    word find-word [
        word-position word length + 1 + :> ptr
        ptr read-int :> #related
        ptr #related [1,b] 4 v*n n+v
        [ read-int read-string ] map
    ] [ { } ] if* ;
Putting this all together, we can construct a file reader from the thesaurus file, a convenience word to run a quotation with the thesaurus as its input stream, and our "related words" function.
: <thesaurus-reader> ( -- reader )
    "vocab:thesaurus/thesaurus.dat" binary <file-reader> ;

: with-thesaurus ( quot -- )
    [ <thesaurus-reader> ] dip with-input-stream ; inline

: related-words ( word -- words )
    [ find-related ] with-thesaurus ;

Try It

If it is all working properly, you should be able to lookup the words that are related to any word that is in our thesaurus file.
( scratchpad ) "food" related-words .
{
    "aliment"
    "bread"
    "chow"
    "comestibles"
    "commons"
    "eatables"
    "eats"
    "edibles"
    "feed"
    "foodstuff"
    "foodstuffs"
    "grub"
    "meat"
    "nourishment"
    "nurture"
    "nutriment"
    "pabulum"
    "pap"
    "provender"
    "provisions"
    "rations"
    "scoff"
    "subsistence"
    "sustenance"
    "tuck"
    "viands"
    "victuals"
}
As for performance, it takes just over one millisecond on my laptop to lookup a single word. Not too shabby! The code for this is on my Github.

1 comment:

Unknown said...

Seems like a nice solution. FWIW, my old Aiksaurus program had a somewhat similar approach, navigating its two data files using fseek to avoid loading everything into memory. Of course, its data files were smaller than what you're dealing with here (about 500 KB together), but at the time this seemed like too much overhead. Ah, how things have come along. :)