Home > Uniclasser

Uniclasser

Uniclasser is a project mainly written in ..., it's free.

A command line tool for generating highly efficient classifiers of Unicode characters (e.g., by their General Category, or even arbitrary user-defined classification).

The uniclasser generator receives a Unicode general category name, and produces a classifier for this category:

    ./uniclasser Lu

This will create five files:

  • uniclasser_Lu.cpp - the classifier itself.
  • uniclasser.hpp - a header file for the classifier.
  • test_uniclasser_Lu.cpp - test suite for the classifier.
  • text_uniclasser.hpp - a header file for the test suite.
  • main.cpp - a simple file that runs the classifier against any supplied character, or runs the test suite if no characters were supplied.

The generated classifier performs a simple task: it takes a single Unicode character and returns a boolean that indicates whether or not this character is part of the classifier's character set:

    bool uniclasser_Lu(codevalue c) { .... }

If the supplied Unicode character c is included in the Lu ("Letter, Uppercase") general category, the classifier returns true.

There are several options to control how the classifier files are created:

  • -t causes the generator to not create the test suite files.
  • -c causes the generator to create the files in C rather than in C++.
  • -u <path> tells the generator to read the unicode data from the specified path (default: ./UnicodeData.txt). You can download the unicode data of the latest unicode version from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.

If you specify several categories seperated by commas, the created classifier will include all characters within any of these categories. For example:

    ./uniclasser Lu,Ll

will produce a classifier that identifies whether a given character belongs either to the "Letter, Uppercase" or to the "Letter, Lowercase" general categories.

If you specify several arguments seperated by spaces, several classifiers will be created. For example:

    ./uniclasser Lu Ll

will produce two classifiers: one for the "Letter, Uppercase" category and one for the "Letter, Lowercase" category. By performing a logical OR between these two classifiers you could of course create the Lu,Ll classifier mentioned in the previous example. However, this combined classifier will be less efficient than the auto-generated one, and perform more JNE's (jump on not equal) than actually needed.

Using or modifying this project is governed by the MIT License.