Home > tree-merger

tree-merger

Tree-merger is a project mainly written in JAVA and PYTHON, it's free.

Converts a list of trees (n-best) to a compact hypergraph representation

TreeMerger by [email protected] 2011-08-23

This programs converts a list of trees (n-best) to a compact hypergraph representation.

Requires: java 1.5+ License: GPL v3

Two things need to be achieved to create a hypergraph from a list of trees:

  • factorize sub-trees (turns a tree to a DAG);
  • merge super-trees by creating OR nodes in the hypergraph.
  1. Factorizing sub-trees:

An easy way to do it is to hash subtrees by their string representation. It's not scalable as the hash table contains as many entries as there are subtrees. A search-based approach would be more appropriate but slower?

  1. Merging two trees:

We want to add OR nodes (+) to represent multiple trees.

A    A   =>   A
|    |        |
B    B        B  
|    |       / \ 

/ \ / \ / \ C D C D C D | | | | /+\ /+\ E F F E E F F E

In the example above, since the two trees on the left only differ by their leaves, it's tempting to put OR nodes to allow multiple representation of the leaves. Four trees would be represented here which is not correct.

Each OR node doubles the number of trees represented, so from the merger of two trees, there can be only one OR node. The only safe merge point is the lowest common parent of the nodes where differences occur. Here it's B.

  A
  |
  B
/ + \
|   |

/ \ / \ C D C D | | | | E F F E

  1. Merging more than two trees:

In our implementation we iteratively merge each tree with the hypertree. The only difference is that we need to compute the difference between a tree and an hypertree, which is straightforward.

However, the resulting hypertree depends on the order of the merges, and might become suboptimal. For instance, consider the trees above with the four leaf variants: EF, FE, EE, FF. Clearly, the hypertree should have OR nodes at leaf positions that can be E or F. If the trees are merged in the following order: EF, FE, EE, FF. Then, a branch with an OR node at B will be created and remain in the hypertree.

Therefore, we have to minimize the hypertree after each merge operation. For that matter, the children of OR nodes are tested for redundancy (can a subtree be represented by another subtree), and cascading OR nodes are merged.

  1. Input/Output formats:

As input, s-expressions (parenthesed trees), and CoNLL'05 one-word-per-line (with parent in 7th column) are supported. As output, when using s-expressions, the hypertree is represented with OR nodes labeled as "-OR-". In CoNLL output OR nodes have empty columns ("_") except for parent id and label "-OR-". In this format, word ids might be changed and trees are not projective. Conversion from one format to the other is possible but uses a non-standard encoding. Both formats are suboptimal because they only represent trees. Therefore, an fsm format is provided to represent the hypergraph.

  1. fsm output format:

The output file format is a list of arcs representing a finite state machine with special arc labels to denote OR nodes.

Each arc is a triple: