Home > ir-lm

ir-lm

Ir-lm is a project mainly written in Emacs Lisp, based on the GPL-3.0 license.

Emacs extension realizing basic mixed language model in information retrieval system.

This is an Emacs Lisp extension realizing a simple mixed language model for information retrieval in paragraphs of files grouped in multiple directories. Paragraphs are assumed to be separated by a blank line. The formula used for sorting relevance is: P(w|p) = lambdaP(w|Mp) + (1 - lambda)P(w|Mc) where Mp is probabilities model for a paragraph, Mc - probabilities model for the whole collection and 0 < lambda < 1. It's only been tested on cvs version of GNU/Emacs 23.1.50 onwards.

Files: ir-lm.el (elisp source) optional: bg-stop-words.txt, stem-rules.txt, stem-rules2.txt, stem-rules3.txt

Adding (require 'ir-lm) to .emacs allows automatic loading of the extension (after adding to load-path). Optional files are grouped in directory "~/.ir/" on posix systems or "C:ir" on windows systems. Stop word files should be put in subdirectory "stop-words/". Files with stemming rules must be in subdirectory "stem-rules/". These directories are recursively scanned so any sort of subdirectory structures would suffice and all files will be used accordingly. Stop word files may contain different languages but stemming rules processing is tuned only for bulgarian at the moment (not really a problem to make a directory for each needed language with stemming rules and then adding some specific functions for loading these rules and stemming for each such subdirectory).

Commands: ir-lm-index Creates and loads an index of all files of directory and its subdirectories. Index is not saved on disk. The option for file types allows multiple glob filters (separated with space) to be applied to file names, thus indexing just specific files. The coding option allows choosing non-default encoding for all files. The option for adding to current index determines whether index is freshly loaded deleting current index or merging current and new indexes thus allowing search in multiple indexes (treated as one from that moment). If auxiliary files (stop words, stem rules) have not been loaded - attempts to load them.

ir-lm-write-index Writes current index to a chosen file.

ir-lm-load-index Loads an index from a chosen file. The option for adding to current configuration is analogous to the option in ir-lm-index, allowing merging of multiple indexes. Index information for files not present in the file system (as recorded in the index) is not loaded. When merging indexes having information for identical (according to path) files, most recently indexed version for such files is chosen. If auxiliary files (stop words, stem rules) have not been loaded - attempts to load them.

ir-lm-search Searches indexed paragraphs for words showing a line resumes in a new buffer and links to the result paragraphs. When a result is clicked (enter also suffices), marker is positioned upon the result paragraph and search terms are coloured.

ir-clear Freeing index data from memory. (there probably is a problem with the way it's done, as Emacs keeps showing as high memory use)

ir-lm-change-lambda Allowing modifying the lambda search parameter for this language model (defaults to 0.5)

ir-change-stem-level Changes stemming level (it only applies for newly indexing and when auxiliary hashes, stop-words and stem-rules have been cleared or not yet loaded).

ir-lm-change-max-results Changes maximum number of search results showed (defaults to 30).

ir-lm-change-min-words Changes the minimum number of words needed for a paragraph to be indexed (defaults to 20) on new indexing.

ir-lm A convenient interface for the above commands as well as links to all files in current index.