Home > lattice-parser

lattice-parser

Lattice-parser is a project mainly written in JAVA and PYTHON, it's free.

Transition-based parsing of word lattices

Transition-based parser for word lattices

by Benoit Favre [email protected] 2011-08-29

This program is a dependency parser using the transition-based framework by Nivre et al. that can process word lattices. For each path in the lattice, it produces the greedy 1-best sequence of transition, resulting in a parse tree. The output is represented as a minimal hypergraph containing all the parses.

License: GPL v3 Dependences: java 1.5+

News

2011-08-29: correct hypergraph output from tree-merger 2011-08-19: uniformized trainer and decoder, simplified data structures 2011-07-14: can parse macaon lattices in fr/en; can output a factorized tree (not a graph) 2011-07-13: factorize output using a subtree representation as hash keys 2011-07-12: lazy loading of the model (ModelOnDisk) 2011-07-08: level of functionality: can parse a lattice and output all parses need to come up with a storage methodology for the output

Usage

train: java LatticeParser --train projective-trees-conll05 model predict: java LatticeParser model < input evaluate: java LatticeParser --eval model test-trees-conll05

To train a model, you need to provide a projective dependency treebank (non-projective trees can be projectivized with the malt parser). Training requires a lot of memory and a fair amount of time, depending on the size of the treebank. The parser uses liblinear to train a model that predicts parsing labeled shift-reduce transitions. The model is saved in a fast-loading binary format documented in FORMAT.

For parsing lattices, you need a model trained with your parser and a lattice representation of your input. See Lattice file format for details. The output is a hypergraph representation of the input which includes arc ids to refer to the input. Arc ids are computed by topological sort.

How does it work?

See doc.pdf (that you can generate with "pdflatex doc.tex")

Ideas

  • c implementation
  • incremental features
  • implement in malt

paper:

  • present algorithm
  • sparse representation?
  • minimal tree automaton?
  • what about arc eager parser?
  • subtrees: explicitly factor tree during parse (because we have all subtrees): use input/chilren/label as hash key incremental: a given tree is already factorized
  • same for common ancestor tree
  • tree automata literature

Lattice file format (input)

The file format is defined as a list of arcs, plus optional (ignored) end states.

word tag

Hypergraph file format (output)

Same as input format with additional -OR- label for alternation nodes in the hypergraph.

id:word:tag label -OR-

Notes

WARNING: liblinear fills the scores[] table with a different order than the label order. (file a bug report)

How do we include scores?

how many arcs? how many states can we process efficiently the resulting trees? => pattern matching => pruning paths => all parses through a word path; all words through a parse

Projectivize FTB using malt

java -jar malt-1.5.2/malt.jar -c pproj -m proj -i data/ftb6.train.conll05 -o data/ftb6.train.conll05.projective -pp head java -jar malt-1.5.2/malt.jar -c pproj -m proj -i data/ftb6.dev.conll05 -o data/ftb6.dev.conll05.projective -pp head java -jar malt-1.5.2/malt.jar -c pproj -m proj -i data/ftb6.test.conll05 -o data/ftb6.test.conll05.projective -pp head

Examples

echo "Jean regarde l'homme qui mange une glace avec des jumelles." | txt2macaon | maca_segmenter | maca_tokenizer | maca_lexer | maca_tagger | macaon2htk -p | perl htk2fsm.perl | java -jar lattice-parser.jar ftb.model.binary | python2 fsm_draw.py echo "Jean likes to eat potatoes." | /data/demo-macaon/translation/maca_translate --text --model en-fr --prune 20 | maca_lexer -t -n -1 | maca_tagger -n 20 | macaon2htk -p | perl htk2fsm.perl | java -jar lattice-parser.jar ftb.model.binary | python view-hypergraph.py

Previous:Hello-World