Home > abbot

abbot

Abbot is a project mainly written in Shell, based on the View license.

TEI conversion utilities


/_ | | | | | | /| ' | ' / | | / |) | |) | () | | _/ _/./|_._/ / __| Your Texts Reloaded

=== WHO? ===

Abbot was developed by Brian Pytlik-Zillig, Stephen Ramsay, and Martin Mueller for The MONK Project (http://www.monkproject.org/), and is distributed under the terms of an OSI-compatible license (see LICENSE for details).

=== WHAT? ===

Abbot is a suite of preprocessing scripts for the MONK project. In general, Abbot coordinates two phases of text preparation:

  1. Normalization of XML-like text collections into TEI-A -- an XML format designed to facilitate corpus-based text analysis.

  2. Validation of the converted files.

=== WHEN? ====

As soon as you're done reading this important announcement.

=== WHERE? ===

All of the code for Abbot is written using a combination of (bash) shell, Java, and XSLT. It was designed to run on UNIX-like systems and avails itself of a number of standard UNIX utilities (such as those found in the GNU coreutils package).

We would hesitate to run it using a version of Java lower than 1.5. We would also hesitate to run it on a system that did not have a fast processor and at least 8 gigs of RAM. Some text collections can take many hours to convert even with the latest server hardware. Your mileage may vary. A lot.

=== HOW? ===

The first phase of the conversion reads the DTD/Schema of the target collection and uses the information found within to generate a customized stylesheet that can affect the conversion from the target to TEI-A. This method, which we call "schema harvesting," is remarkably robust, but it cannot perform miracles. If your texts do not parse or contain a lot of irregular constructions, you will probably need to do some pre-processing prior to the pre-processing with Abbot.

Abbot is set up as a pipeline (with abbot.sh as the main controlling file). If you look in that file, you'll references to a series of modular shell scripts, most of which perform quick corrections on the converted files. We used Abbot to convert some very large, and very well known text collections, and so these scripts contain adjustments for common errors and irregularities (including some that are unavoidably introduced through schema harvesting). You may find it useful to use these scripts as a guide, adding to the pipeline and adjusting the existing scripts for your own circumstances.

The second phase pass involves validation of converted files against the TEI-A schema using Sun's Multi-Schema XML Validator (MSV). The TEI-A schema itself is located here: http://segonku.unl.edu/teianalytics/. Documentation for the schema is located at http://segonku.unl.edu/teianalytics/TEIAnalytics_doc.html.

Running abbot is simply a matter of:

abbot.sh [target_dir]

Abbot will write the results to the "output" directory. Invalid files are sent to the "quarantine" directory for review.

Please note that abbot is self contained and ships with all the necessary libraries. You should run it from the abbot directory and leave everything as it is.

=== HUH? ===

  1. Why I do I need to leave everything as it is?

Abbot was developed for in-house use on the MONK project, and we didn't see any need to write an installer. We could make one if enough people think it would be useful. You can probably do it yourself, if you don't mind fooling around with CLASSPATH and PATH variables. You can also adjust the main script slightly so that it works more transparently with pipes and redirection operators.

  1. Shell scripts? Sed? Are you kidding?

Abbot spends most of its time controlling other applications (like Saxon and MSV), and we think the shell is ideal for exactly this sort of thing. Since speed becomes a factor with large text-processing jobs, we prefer small, fast apps like sed and grep (which are written in C) to their equivalents in the Java (Ruby, Python, etc.) API.

  1. Do my texts have to be valid? Well formed? Can they be in SGML?

Your texts don't have to be perfect. However, we found ourselves doing a lot of hacking (some prior to the run, some within the Abbot pipeline) to get it to work. If your texts are valid, well formed, and in XML, it should work right out of the box. We have used Abbot to transform several SGML collections, though a bit of preprocessing (via James Clark's OSX application) was required to get the files into XML first.

  1. Can I use Abbot to migrate my texts from P4 to P5?

Yes, you can, since TEI-A is P5 compliant. However, we consider this a side effect of Abbot, insofar as we did not explicitly create a tool for this purpose. Since your goals likely go beyond the needs of corpus-based text analysis, you may find that there are some gaps. That said, Abbot is a remarkably good P4 to P5 converter in most ordinary circumstances.

  1. What about part-of-speech tagging, lemmatization, tokenization, sentence splitting, and named-entity extraction? Don't you need all that for corpus-based text analysis?

Yes! And this exactly what we did for MONK, but not with this tool. The next stage of the MONK pre-processing pipeline (after Abbot) used MorphAdorner -- an amazing program that does all of the above. MorphAdorner was written by the inimmitable Phil "Pib" Burns and Northwestern. It is particularly adept when it comes to handing TEI-A files. You can read all about it at:

http://morphadorner.northwestern.edu/

  1. This is great! When are you going to turn it into a real userland tool?

Thanks, we are thinking about doing something like that. We'd also like it to take advantage of multi-core systems, maybe have some kind of GUI, produce better logging and reporting, and make it easier to customize.

  1. This sucks! When are you going to make it into a real userland tool?

We're sorry. If you have specific suggestions, bug reports, or feature requests, please feel free to contact the developers:

Brian Pytlik-Zillig «[email protected]» Stephen Ramsay «[email protected]» Martin Mueller «[email protected]»