Home > Wiktionary-dump-to-Redis

Wiktionary-dump-to-Redis

Wiktionary-dump-to-Redis is a project mainly written in Perl, it's free.

Take a Wiktionary dump file, import it into Redis to lookup definitions in your own projects.

This is a simple script designed to read one or more Wiktionary XML dump files and import definitions into Redis. It needs more far rigorous testing, but the results seem good enough to experiment with.

Usage

$ wiktionary_to_redis.pl <2nd_wiktionary_dump.xml> ...

You can import one or multiple files. Definitions in files listed first are used. This was designed to use definitions from Simple Wiktionary if available, then English Wiktionary as a fallback.

Make sure Redis (http://redis.io) is running first. Definitions are imported as a list with format :::.

After the import is finished, you can access definitions like this:

$ redis-cli redis> LRANGE sink 0 -1

  1. "noun:::A '''sink''' is a [[container]] for water to wash things in, usually with a [[drain]]."
  2. "noun:::A '''sink''' is a [[device]] that [[remove]]s [[heat]] or [[energy]]."
  3. "verb:::{{ti verb}} If something '''sinks''', it goes down, usually into water."

Queries with spaces need to use quote marks via the Redis CLI:

redis> LRANGE "Pacific Ocean" 0 -1

  1. "proper_noun:::The '''Pacific Ocean''' is a very large [[body]] of water east of [[Asia]] and west of the [[America]]s."

Getting the data

Dump files are available from:

http://dumps.wikimedia.org/

Currently, it's only been tested to work with the Simple English and English sites:

http://download.wikimedia.org/simplewiktionary/latest/simplewiktionary- latest-pages-articles.xml.bz2 http://download.wikimedia.org/enwiktionary/latest/enwiktionary-latest- pages-articles.xml.bz2

Wiki markup on other Wiktionary sites varies, so I can't guarantee it will work for anything else. By default, it will assume the same markup as English Wiktionary.

Updating the dictionary

If you're running the script with updated dump data, current definitions won't be updated, but new words will be added.

If you want to re-import a new dump into the dictionary, you'll need to clear the data already in Redis to avoid duplicate definitions:

$ redis-cli redis> flushall

Dependencies

Redis MediaWiki::DumpFile Data::Dumper