Home > search_in_memory_php

search_in_memory_php

Search_in_memory_php is a project mainly written in PHP, it's free.

SearchInMemory is a reseach about full text (fault-tolerant) searches

SearchInMemory as a reseach about full text/fault-tolerant searches

SearchInMemory is my first impression about how works and how should work full text search engines or fault-tolerant searches (FTS). It begins as a research about how stuff works.

It was also a test for: is it hard to write own full text search engine?

You may consider it as a experiment to uncover how full text search engines works.

I have also tried a BinaryTree implementation of index, but it wasn't so good for me as HashIndex.

But even when I finished a really huge part of code there is still place for improvements(you can use it as a roadmap for your own FTS), like:

  • n-gram indexing and searching with levenstein sorting
  • wildcard searching
  • excluding some phrases from results
  • caching generated indexed on some memory based structure on disk
  • steeming words to others
  • improved and more complex way to do faceting
  • a socket connector for searcher from a unix level
  • index updating and deleting particular records
  • whole phrases in " " signs, like: "billy bob" to match exactly this phrase (need to improve inverted indexes in HashIndex)
  • possibilities of import/export data from indexes using formats: json/xml/csv
  • tweaking results based on special criteria or queries

Cheers, Rafal "RaVbaker" Piekarski

Contact: web: http://about.me/ravbaker twitter: ravbaker github: https://github.com/RaVbaker

Great start for your own research:

  • http://en.wikipedia.org/wiki/Levenshtein_distance - a minimal knowlegde about comparing similar words

  • http://en.wikipedia.org/wiki/Inverted_index - goot start for building indexes - specially full inverted indexes

  • http://en.wikipedia.org/wiki/N-gram - N-grams, what it is and why?

  • http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html - quite old Google Research department post about n-grams in practise - with a large available dataset.

  • http://ngrams.googlelabs.com/ - a practise usage of ngrams with Books Ngram Viewer from Google.

  • http://today.java.net/pub/a/today/2005/08/09/didyoumean.html - great article about how Did you mean works with Lucene. Very inspiring post - but mainly about Java.

  • http://framework.zend.com/code/filedetails.php?repname=Zend+Framework&path=%2Ftrunk%2Flibrary%2FZend%2FSearch%2FLucene.php - sourcecode from Zend Framework with their PHP implementation of Lucene. Nice source of thougts.

  • http://www.ir.uwaterloo.ca/book/ - A book when you think BIG. It's about building your own service for full text search engine scalable almost like Google/Bing. Lots of theory and C/C++ code and algorithms. For very begining I suggest reading an excerpt from chapter 4 - Static inverted indicies - http://www.ir.uwaterloo.ca/book/04-static-inverted-indices.pdf

  • Helpful php functions: http://www.php.net/manual/en/function.levenshtein.php, http://www.php.net/manual/en/function.metaphone.php, http://php.net/manual/en/function.soundex.php, http://docs.php.net/manual/en/language.types.array.php :)

Previous:relationships