BitextMiner is a project mainly written in ..., it's free.
to mine Russ./Eng. sites for translations
CLMA Summer 2010 Project "Bitext Miner" Alexander Shchetnikov
Tasks
Crawling websites
Extracting bitext
Directories
Tasks
The goal of the project is to crawl websites containing English text with Russian translations, to extract such bitext data and postprocess it.
As initial seeds, four websites were chosen for crawling:
http://usefulenglish.ru http://englishouse.ru http://www.proz.com http://homeenglish.ru
They appeared to be most informative and they seemed to have maps favourable for processing. Two open source software were considered for crawling: Nutch and Wget. Wget appeared to be most suitable - because of its simplicity and crawling output options.
The two Wget commands were used: wget -i file (to download URLs specified in the file) wget -r -l1 -np http://page1/page2/ (to download recursively all URLs, 1 level down from page2) Minor modification of these commands were tested (increasing level of download, adding option -w 5, which is 5 seconds pause between retrievals, - useful to crawl http://www.proz.com which has retrievals per time limitations).
Downloaded URLs were organized into folders corresponding each website. Some files were put directly in the folder, some files, as a result of a recursive download, were put in a tree of subfolders.
Scripts for crawling are: crawl_englishouse_homeenglish.sh, crawl_proz_usefulenglish.sh - they require a file with urls as an argument (crawl_englishouse_homeenglish.sh urls_englishouse.txt), the files with urls are provided.
Processing downloaded files for bitext data had some challenges:
bitext_english
bitext_russian
bitext_english
(usefulenglish.ru)bitext_russian
(usefulenglish.ru)To extract as much data and, at the same time, keep the code as simple as possible, I wrote four different code - one for each website. If needed, these 4 codes could be combined.
Codes are written in Java, Filter.java, Filter1.java, Filter2.java, Filter3.java. All of them accept three files as input - path to the folder, name of the file for english output and name of the file for russian output. Some codes create working files (.txt) for intermediate processing.
Processing of the four sites yielded between 30 and 32 thousand bilingual pairs. "Bad" data (swapped languages, incorrect translation, no content, etc.) is estimated to be about 1% of it, and it is produced mostly after processing www.proz.com.
On patas directories are arranged in the following order:
The root directory is /home2/shchea/project. It contains scripts for crawling, code projects folders (Filter, Filter1, Filter2, Filter3), urls files (input for crawling), folders with downloaded raw data (englishouse.ru, homeenglish.ru, usefulenglish.ru, www.proz.com), output_data folder (with bitext data for each site in four subfolders - output_data/englishouse.ru, output_data/homeenglish.ru, etc.; I named each file with English data as bitext_en.txt and each file with Russian data as bitext_ru.txt); and this README file.
On github the folders are: