Levenshtein is a project mainly written in Ruby, it's free.
Find words based on levenshtein distance
####################################################################################################
####################################################################################################
####################################################################################################
#################################################################################################### This application took ~19seconds (loading the population + building connections + counting nodes for causes). The code was run on OSX with a 2.4GHz Dual Core processor. Execution time depends on other cpu/memory hogging applications running at the same time.
####################################################################################################
####################################################################################################
Keeping the base problem in mind i.e. to solve for a Levenshtein distance of 1 meant finding each word's children (a distance of 1 char less), siblings ( substitution of 1 char) and parents (1 char more).
Finding children is straightforward as a word of length S can have at most S children. For example children of the word causes would be: auses, cuses, cases, caues, causs, cause, which could be simply looked up in the hash for words of length 5. Thus finding children can be reduced to creating S possible words and looking them up among the words of length S-1.
Finding siblings is a little less straightforward, but is similar to finding children. Two siblings will always have a child in common by eliminating the character that is different. However, the index position at which the character is eliminated is important, as "causes" and "abuses" have a child "auses", but are not siblings. On the other hand, "causes" and "pauses" have "auses" as a child and are siblings. This means, siblings can be found using a children list, memorizing the position of the character eliminated, and connecting words with the same child. For example: one possible substitution of causes is auses where = any letter from a-z. Therefore, the hash for level 6 would contain {"auses1" => "causes"} - where 1 is the index position of the letter substituted and the value is the word. (Technically it holds the Node object for that word, but here I just use the word for simplicity). So the hash for causes would look like this: {"auses1" => "causes", "cuses2" => "causes", "cases3" => "causes", "caues4" => "causes", "causs5" => "causes", "cause6" => "causes"}
Pretend the word causes has already been processed and the hash is loaded as above. Now when the iteration comes up against the word "pauses", it creates the word substitution combination below. The hash for pauses would look like this: {"auses1" => "pauses", "puses2" => "pauses", "pases3" => "pauses", "paues4" => "pauses", "pauss5" => "pauses", "pause6" => "pauses"}
Except instead of adding the key "auses1" that already exists, we found a sibling of "pauses" with a Levenshtein distance of 1 and connect it with "causes".
One the other hand, "abuses" will create "auses2", which will not match "auses1", and therefore not be linked to causes - which is correct, as the Levenshtein distance of "causes" and "abuses" is greater than one. It is worth noting that the above approach is more performant than iterating from a-z, substituting the respective index position character for each letter in the alphabet and then checking if the word exists in the population. The latter would take anywhere from S to S*26 iterations, where S is the length of the string; whereas the above approach would take at most S iterations.
* What about parent nodes? Finding parent word combinations is the hardest puzzle to solve since it is
expensive to guess what the parent of a word should be. Given there are 26 letters in the alphabet,
if a word is S letters long, that would make 26S parent word combinations (times 260,000 words).
We can get around this by building connections starting with the queue of longest words, finding siblings
and children and then moving one level down and repeating. Thus we get the parents for free through
finding the children.
Networks and Nodes: Each word is saved as a node object. When the population hash is loaded for the first time each word is converted into a node representing the word. Each node has an attribute called first_degree_friends to which it adds any children/siblings it runs across once connection building starts. Each time a node is added as a friend, it is doubly linked (see the add_friend method in the Node class). The reason for a two way link is to be able to walk the network from any node and not to end up at dead ends.
Walking the network: This is basically walking a double linked list. To make sure each node is counted once and to avoid circular traversals, an attribute on each node called "touched" is set to 1 if the node is counted.
Computational time: The complexity of the above algorithm is O(P+N) = O(P) (N <= P) where P = size of the population and N = size of network versus O(P*N) = O(P^2) (N <= P) when iterating through the network. Processing the population as above costs less than the computational sum of loading the population and then iterating through each word in the network to find friends of a specific word.
####################################################################################################
####################################################################################################
####################################################################################################
####################################################################################################
Then I could split my input string into half like so: "cau"/"ses" for even strings or "ca"/"use" for odd strings. Then all that was required was looking up strings that matched the first/last half of the input string and iterating through those. The idea was to do as few iterations as possible since hash lookups are much cheaper than iterating through arrays. This approach got the run time down 480 seconds or 8 minutes. 4. The problem still was that this type of algorithm had a complexity of PN where P = size of population and N = size of the network (or P’ N where P’ = f(P) = reduced size of population (after organizing by length/word splits etc.)). That means as the population gets bigger, the network is likely to get bigger, which means the computational time will grow quadratically rather than linearly. This led to the final solution, which does not depend on the network size. It is also worth noting that the final solution yields every possible network in the population, and can be easily scaled to processing word-length-queues in parallel since finding friends at each level is independent of others.
####################################################################################################
####################################################################################################