NFreq

About

A screenshot of very little

NFreq is a character-ngram generator, designed to classify language based on frequencies thereof. It it written in Haskell using SQLite3 as a storage medium, which means that it can boast static memory usage (and thus the ability to process corpora of indeterminate length) and reasonable speed. Because of the architecture it also comes with some handy Ngram generation and processing libraries, which have proven to be very fast when used alone.

By default the tool compiles with some runtime-configurable concurrency support through Haskell's sparks system, though in practice this has overhead sufficient to destroy any benefit.

Download

Download nfreq-0.2.tar.gz.

Usage

Run the tool with no options to see what to do. Stay tuned for some more helpful text, folks.

$ ./nfreq
nfreq version 0.2, (C) Stephen Wattam 2011

nfreq [COMMAND] ... [OPTIONS]
  NGram frequency based language classification tool

Common flags:
  -c    --no-lcase                 Don't cast to lowercase
  -i    --input=FILE               Input files (default: stdin)
  -d    --dbfile=FILE              Database file (default: "ngrams.db")
  -t    --tbname=STRING            Table name (default: "ngrams")
  -n    --n-list=INT               Which ngrams to generate (default:
                                   [2,3,4,5])
  -l    --ignore-list=ITEM         List of chars to ignore (default: "
                                   \\t\\n\\r',:_;")
  -b    --break-list=ITEM          List of chars to break on (default:
                                   ".@\\\"*\\\\/{}[]()!?-1234567890+=&%|")
  -C    --conf=ITEM --oconf        Read ignore, break lists from config file
  -? -h --help                     Display help message
  -v    --version                  Print version information

nfreq build [OPTIONS]

Examples:

nfreq compare [OPTIONS] [ITEM]

  -H    --human-readable --ohuman  Enable human-readable output, rather than
                                   JSON.
  -T    --text-input --ointext     Use a temporary table to load and analyze
                                   text in one step.
  -a    --algorithm=ALG            Comparison algorithm (default: "freqsum",
                                   possible: freqsum
  -r    --ref-table=TABLE          Reference corpus: compare this to all the
                                   others (i.e. the corpus you wish to
                                   classify)

Examples:
	nfreq compare -i ngrams.db -r unknown english french latin