NFreq is a character-ngram generator, designed to classify language based on frequencies thereof. It it written in Haskell using SQLite3 as a storage medium, which means that it can boast static memory usage (and thus the ability to process corpora of indeterminate length) and reasonable speed. Because of the architecture it also comes with some handy Ngram generation and processing libraries, which have proven to be very fast when used alone.
By default the tool compiles with some runtime-configurable concurrency support through Haskell's sparks system, though in practice this has overhead sufficient to destroy any benefit.
Download nfreq-0.2.tar.gz.
Run the tool with no options to see what to do. Stay tuned for some more helpful text, folks.
$ ./nfreq
nfreq version 0.2, (C) Stephen Wattam 2011
nfreq [COMMAND] ... [OPTIONS]
NGram frequency based language classification tool
Common flags:
-c --no-lcase Don't cast to lowercase
-i --input=FILE Input files (default: stdin)
-d --dbfile=FILE Database file (default: "ngrams.db")
-t --tbname=STRING Table name (default: "ngrams")
-n --n-list=INT Which ngrams to generate (default:
[2,3,4,5])
-l --ignore-list=ITEM List of chars to ignore (default: "
\\t\\n\\r',:_;")
-b --break-list=ITEM List of chars to break on (default:
".@\\\"*\\\\/{}[]()!?-1234567890+=&%|")
-C --conf=ITEM --oconf Read ignore, break lists from config file
-? -h --help Display help message
-v --version Print version information
nfreq build [OPTIONS]
Examples:
nfreq compare [OPTIONS] [ITEM]
-H --human-readable --ohuman Enable human-readable output, rather than
JSON.
-T --text-input --ointext Use a temporary table to load and analyze
text in one step.
-a --algorithm=ALG Comparison algorithm (default: "freqsum",
possible: freqsum
-r --ref-table=TABLE Reference corpus: compare this to all the
others (i.e. the corpus you wish to
classify)
Examples:
nfreq compare -i ngrams.db -r unknown english french latin