Home > Computational-Linguistics

Computational-Linguistics

Computational-Linguistics is a project mainly written in JAVA and RUBY, based on the View license.

Collocation extraction from text corpora

== ComputationalLinguistics

Aim of this project is to extract collocations from Brown text corpus.

Steps to do this:

ToDo: 1 (possible). build a hash from NN_NN, JJ_NN, VB_NN pairs. the key is the first word in a collocation (head) second is array of tails. 2+. split line into words. 3+. split word into 4+. iterate through the corpus.

  1. count the matches in collocation units.

Real steps done:

  1. opens xml document
  2. navigates xml document for xml chunks.
  3. From xml chunks go to sentences.
  4. Parse sentences for bigrams.
  5. Builds hash of startings of collocations with counter.
  6. Builds hash of endings of collocations with counter.
  7. Count simple frequency of bigram in document
  8. count chi-square values for bigrams

Fixes:

  1. Fixed 'one bigram in a file' problem. This problem appears when file hasn't dot at the end of the sentence.
  2. Hash.sort method change the type to an array

You should document your project here.

TODO:

  1. How to measure the corelation?
  2. What about the context?
Previous:small_stuff