Home > flickorpus

flickorpus

Flickorpus is a project mainly written in PYTHON and SHELL, based on the MIT license.

flickorpus collects an image and tag corpus from flickr.

flickorpus

flickorpus collects an image and tag corpus from flickr. For different tag queries, it finds the N most relevant images, and retrieves the tags for those images. It never actually downloads images themselves, it merely has sufficient metadata to concoct the correct URL.

Written by Joseph Turian

Currently available at: http://github.com/turian/flickorpus/tree LICENSE ------- MIT license. See LICENSE.txt If you collect good data with this program, we encourage you to pay it forward and share with others. REQUIREMENTS ------------ * Python * Beej's Python Flickr API: * ZODB: * Our python common library, in your PYTHONPATH: git clone git://github.com/turian/common.git If you download 'common' into directory DIR, you have to put DIR in your PYTHONPATH: export PYTHONPATH=$PYTHONPATH:$DIR GETTING STARTED --------------- * Get the source code, in a location present in your PYTHONPATH: hg clone http://pylearn.org/hg/flickorpus * Install Beej's Python Flickr API: easy_install flickapr * Copy keys.py.tmpl to keys.py, and enter your flickr API keys. * ZODB: * Install ZODB. For example: easy_install ZODB3 mkdir db/ * If you want only one local ZODB instance (simpler): * Copy flickr-local.conf to flickr.conf * Or, if you want to allow multiple clients to save data to your ZODB instance (more flexible): * Get a ZODB server running, e.g. copy zeo.conf.tmpl to zeo.conf, (optionally) edit zeo.conf and zeo.sh, and run ./zeo.sh If you like, you can edit zeo.sh to explicitly specify the hostname. * Copy flickr.conf.tmpl to flickr.conf, and edit the HOSTNAME to match the host on which the ZEO server was launched. * Run ./searchspider.py to begin accumulating a corpus. If you do not see any progress within a few seconds, and you are using the client/server ZODB approach (above), try using the local ZODB instance. * Run ./dumpdatabase.py to dump the database in a JSON format. SOME DETAILS ------------ ZODB gives the program persistent state, so you can start and stop execution as you please. Work will continue where you left off. Flickr API queries are cached in ZODB. Note that search results may get stale over time. Flickr typically throttles you to two requests per second. It wouldn't be too hard to extend this program to operate concurrently. I have not done so, though. Email me for more details. ./searchspider.py The search spider begins with the seed queries in myflickr.seed_queries. For each query, it searches for the 1000 (= myflickr.total) most relevant results with this tag. This requires 2 = 1000/500 (= myflickr.total / myflickr.queries_per_page) API requests. For each of the 1000 results, it retrieves their metadata. In particular, it retrieves every tag for the image. If an image ID was returned previously in another query, then its metadata is already cached and does not need to be retrieved. Otherwise, an API request is issued. After all seed queries have been pulled, the spider goes over the entire database of stored image metadata. It determines the most common tag that has not already been queried and that is not present in myflickr.blacklist_queries. It then queries this tag. Lather, rinse repeat. ./getresults.py tag Search for the 1000 (= myflickr.total) most relevant results with this tag, and output the 75x75 images in HTML. ./tags.py For every image in the ZODB, retrieve its tags and output them. BUGS ---- Doomie found a bug in the database initialization: "So I managed to quickly hack a fix to the problem by simply initializing the dictionary of objects in the DB. Wonder if there's a cleverer fix for that :) " COMMENTS, QUESTIONS, ETC. ------------------------- Please email the author.
Previous:hamster-basecamp