Lately, I like to stop by Quora to answer questions as far as I can humbly answer and to follow recommendations and other views on my hobbies. Today I found a topic that I found interesting, an algorithm called Aho-Corasick, used to find patterns in a text. You may be thinking, for that we already have engines like Lucene, and yes, it is true, but Lucene is optimized to find a very small number of words in the data set. What if we want to find an exorbitant number, like 100000 words in a text? That’s what this algorithm is for.
Here is an answer from an engineer who has an implementation of the algorithm in github, Robert Bor. By the way, thank you Robert.
“The algorithm is strongest if you look for multiple words in a text. The reason for this is that aho-corasick converts the text you are searching in into a kind of index (the Trie). Then it runs through that index exactly once for all the words you are searching for.
The best way to benchmark the algorithm, would be to set up a single Trie of a large body of text (for example, Shakespeare) and then use a large set different words (for example, a dictionary) to look for matches. Basic String.indexOf would repeatedly search the same text, where AC uses the Trie and runs through it just once. The more words you are searching for, the faster it becomes in comparison.”
It would be tremendously interesting to be able to use this algorithm in distributed environments where we can load really gigantic dictionaries, I think of Terabytes, Petabytes, Hexabytes of data in text form, perhaps genomic information, or particle collision information from an accelerator, data from all human beings, something really gigantic.
Have fun and take care.