Type-level-tagger is a project mainly written in Clojure, based on the EPL-1.0 license.
State-of-The-Art Unsupervised Part-Of-Speech Type-Level Tagger in 300 Lines of Clojure
This is a short self-contained Clojure implementation of:
Simple Type-Level Unsupervised POS Tagging Yoong Keok Lee, Aria Haghighi and Regina Barzilay To appear in proceedings of EMNLP 2010
Simply run the script with the following arguments
infile: path to a file where each line is a sentence and tokens are space separate
outfile: path to write mapping of words to tags (represented by an integer)
num-iters: number of iterations to run Gibbs Sampler
K: number of tag states to use
alpha: hyper-parameter for type-level distributions (try 1)
beta: hyper-parameter for token-level distributions (try 0.1)
If you want to test this on a corpus of the appropriate size, I have the Brown corpus (approximately 1 million tokens) which you can use as input at http://db.tt/aYl0tfx. Only available for non-commercial purposes.
Aria Haghighi ([email protected])
My Website
Email author with any issues.
Copyright (C) 2010 Aria Haghighi
Distributed under the Eclipse Public License, the same as Clojure uses. See the file License.