Lattice-parser is a project mainly written in JAVA and PYTHON, it's free.
Transition-based parsing of word lattices
by Benoit Favre [email protected] 2011-08-29
This program is a dependency parser using the transition-based framework by Nivre et al. that can process word lattices. For each path in the lattice, it produces the greedy 1-best sequence of transition, resulting in a parse tree. The output is represented as a minimal hypergraph containing all the parses.
License: GPL v3 Dependences: java 1.5+
2011-08-29: correct hypergraph output from tree-merger 2011-08-19: uniformized trainer and decoder, simplified data structures 2011-07-14: can parse macaon lattices in fr/en; can output a factorized tree (not a graph) 2011-07-13: factorize output using a subtree representation as hash keys 2011-07-12: lazy loading of the model (ModelOnDisk) 2011-07-08: level of functionality: can parse a lattice and output all parses need to come up with a storage methodology for the output
train: java LatticeParser --train projective-trees-conll05 model predict: java LatticeParser model < input evaluate: java LatticeParser --eval model test-trees-conll05
To train a model, you need to provide a projective dependency treebank (non-projective trees can be projectivized with the malt parser). Training requires a lot of memory and a fair amount of time, depending on the size of the treebank. The parser uses liblinear to train a model that predicts parsing labeled shift-reduce transitions. The model is saved in a fast-loading binary format documented in FORMAT.
For parsing lattices, you need a model trained with your parser and a lattice representation of your input. See Lattice file format for details. The output is a hypergraph representation of the input which includes arc ids to refer to the input. Arc ids are computed by topological sort.
See doc.pdf (that you can generate with "pdflatex doc.tex")
paper:
The file format is defined as a list of arcs, plus optional (ignored) end states.
Same as input format with additional -OR- label for alternation nodes in the hypergraph.
WARNING: liblinear fills the scores[] table with a different order than the label order. (file a bug report)
How do we include scores?
how many arcs? how many states can we process efficiently the resulting trees? => pattern matching => pruning paths => all parses through a word path; all words through a parse
java -jar malt-1.5.2/malt.jar -c pproj -m proj -i data/ftb6.train.conll05 -o data/ftb6.train.conll05.projective -pp head java -jar malt-1.5.2/malt.jar -c pproj -m proj -i data/ftb6.dev.conll05 -o data/ftb6.dev.conll05.projective -pp head java -jar malt-1.5.2/malt.jar -c pproj -m proj -i data/ftb6.test.conll05 -o data/ftb6.test.conll05.projective -pp head
echo "Jean regarde l'homme qui mange une glace avec des jumelles." | txt2macaon | maca_segmenter | maca_tokenizer | maca_lexer | maca_tagger | macaon2htk -p | perl htk2fsm.perl | java -jar lattice-parser.jar ftb.model.binary | python2 fsm_draw.py echo "Jean likes to eat potatoes." | /data/demo-macaon/translation/maca_translate --text --model en-fr --prune 20 | maca_lexer -t -n -1 | maca_tagger -n 20 | macaon2htk -p | perl htk2fsm.perl | java -jar lattice-parser.jar ftb.model.binary | python view-hypergraph.py