Searching is hard... Pt. 1

Note: This blog post is part one of a two-part series. The second part has not been posted yet.


I won't mince words — searching for things is hard. It ranks up there with password hashing, and given the opportunity, I'd farm out the process to a library that I know does the job.

Nobody I know personally has built a search index by hand, because it'd be a complete waste of time (especially since you'd be reinventing the wheel trying to solve a problem that's been solved multiple times already). So where does that leave us when it comes to forum software?

Typically, full-text search is used. All forum software prior to the 2000s* is based on a relational database, and the simplest way to enable searching is to use the built-in language constructs:

SELECT post_id, content FROM posts WHERE content='%term%';

It's simple and it does the job. Problem solved, right?

The main advantage to this method is its simplicity (barely any developer overhead and maintenance), combined with a pretty good method of finding things. If you like overused business buzzwords, it's the lowest hanging fruit on the tree of search indexing. The highest fruit is a search algorithm on-par with Google.

The problem with this method of searching (and it's a bit of a stretch to call it a problem, since in 90% of cases, it is sufficient) is that it doesn't scale well. Let's run through an example:

  • In a naive search, you are looping through every single entry, looking for a match to the search term. The length of time it takes to complete this search goes up linearly with the amount of content available (in Big-O notation, that'd be O(n))
    • If you have multiple search terms and want to conduct the search so that Farmer John returns results for Farmer or John, then you'd have to run the search twice (so, O(2n), or twice as complex), not to mention that you'd also have some post-processing to do, so you can reduce the two disparate datasets into one result set. You could minimise the post-processing by doing a search for each term (and of course, every combinations of each term, right?), but at the point, you're probably better off taking a good long vacation.
  • Alrighty, so you can increase the efficiency of the search by creating an index, so instead of iterating through the content itself, you iterate through a comparatively small index. The part some people forget is that the building of the index must be done every time the searchable collection is updated (e.g. a new post is made), meaning the scaling issue is still prevalent, just postponed.
    • It's safe to assume that a new entry would not necessitate the complete deletion of the existing index, followed by a re-index, so using an index is universally a better idea than not having one at all.

Either way, search is slow, so much so that some forums restrict searches to registered users only, and even registered users could only search once every x seconds/minutes (!!)**.

Another interesting point to make is that during the initial development phase of many existing forums, there was no precedent for interoperability. Almost all applications relied on themselves, and the implementation of APIs was often a cumbersome process that needed much fine-tuning (anybody remember SOAP with XML data?). Using a full-text search made available through the database was pretty much the easiest and only option at the time.

tl;dr We have a unique opportunity to revolutionise the way search is conducted in forums.

Our Experiment

In the same vein as lots of our other implementations in NodeBB, we wanted to experiment with search. We also had the restriction that our data set was stored in Redis. This severely limited our options to:

  • An off-the-shelf search module
  • Hand-rolling our own search system and indexing

We found Reds, and implemented it soon thereafter. We were happy with the results — it was quick and simple to implement, used Redis, and it even had phonetic search, so even if you misspelled a word, you'd still find what you were looking for! Neat!

Problems arise

We first started noticing problems when we received reports that the search functionality was returning false positives. The phonetic indexing works in this way:

Currently reds strips stop words and applies the metaphone and porter stemmer algorithms to the remaining words before mapping the constants in Redis sets ... [This] means that phonetically similar words will match, for example "stefen", "stephen", "steven" and "stefan" all resolve to the constant "STFN". Reds takes this further and applies the porter stemming algorithm to "stem" words, for example "counts", and "counting" become "count".

This method does fall short given that there are many homonyms in English, and many words that share the same letters when normalised. For example, forum and farm share the same normalised entry: FRM, despite having radically different meanings. When working with full-text searching where strict relevence-based results are desired, the convenience of being able to misspell search terms did not outweigh the rate of false positives.

The final nail in the coffin was the internationalization effort of NodeBB. Reds, being phonetic-based, simply could not reduce any language that used a non-latin character set. I'm pretty sure it also choked on accented characters as well, though I could be wrong. This limited Reds' applicability to forums that spoke only English, with reduced search performance on other languages, or complete unusability on Asian and Cyrillic languages.

Clearly, a solution was called for, which I will expand on further in the next part of this post.

Notes

* Needs citation -- I'm only generalising
** Imagine if a user wanted to DDoS a php-backed forum running on a server with only 3 worker threads. That user would only need to contunuously execute three search queries in order to completely block access to any other user!

Julian Lam

Read more posts by this author.

Toronto, Ontario