wikiextractor

Home > wikiextractor

Wikiextractor is a project mainly written in JAVA and SHELL, it's free.

Extraction of Wikipedia plain text bi-lingual page pairs

This project contains some code and scripts to extract plain text pairs of linked wikipedia articles in English and L2.

Get wikipedia dumps from wikimedia.org:
- Download (tools/download.pl) and unzip (tools/unzip.pl) wikipedia dumps for a set of languages.
Extracting interlinked plain text page pairs
- Compile the wikiextractor project (tools/build.sh). The script generates ../lib/wikiextractor.jar.
- Extract plain text pages (extractAll.pl) for a set of languages. Pages are extracted if they have inter-language links to corresponding pages in English. Filenames are the names of the corresponding English articles (both plain text articles and original markup are saved).
- Pair up the articles in English and L2 (allpairs.pl), saving names of the pairs in list.txt.

E.g. (for en-es):

cd tools perl download.pl languages dumps perl unzip.pl languages dumps

bash build.sh perl extractAll.pl languages dumps out

mkdir -p out/pagepairs/logs perl allpairs.pl en es out/pages out/pagepairs/en-es/pairs &> out/pagepairs/logs/en-es.pairs.log

Previous：PostHelp

Next：jNet