When was the last time you went on YouTube and clicked through the menus to find the video you were looking for? You don’t remember? Neither do I. What you do when you go to a site like YouTube is type in some keywords and click the search icon. That’s full-text search. YouTube will return the most relevant videos based on your keywords. Same is true with Amazon, eBay, and many other eCommerce sites. In fact, full-text search is so common that we take it for granted. Without it, we’d find ourselves clicking link after link like we’re back in the 90’s.
Despite how common site search functionality is, too many of us tend to rely on a pre-packaged solution. As a result, it’s hard to find a detailed tutorial on how to implement one yourself. In this post, I want to help you do precisely that (I should note, though, that when I say implement, I don’t mean writing a search engine from scratch. It’s neither practical nor logical to do that. Instead, I’m referring to integrating a battle-tested search engine solution into your website).
About a year ago, I started working on Fooreviews, a site that attempts to take Amazon reviews and group them by product features. I wanted full-text search to be a core feature of the website because that way, the experience of getting the information you’re looking for would be familiar. Namely, you’d type the name of a product in the search bar and get the most relevant results just like you would on Amazon. Because I was writing a Django site, I had several options of making that happen. In the following section, I’ll briefly discuss those options and jump into the tutorial.
Overview of Existing Solutions
The very first search engine solution that I considered is called Whoosh. Whoosh is a pure Python library that implements text scoring and indexing, allowing you to make relatively fast queries. Because it’s written in Python, you can implement it natively into any Python project that requires search capability. But as with anything written in Python, its emphasis on simplicity and native behavior comes at a cost of speed. Personally, I’m not aware of any traffic-heavy site using Whoosh. Although it may be great for a small project or a small business, it may not be the best choice for anything more.
Xapian is a C++ search engine library with bindings in several languages, including Python. As the main project site says, it supports complex boolean queries, which is a necessity for handling complex search and indexing scenarios. Although Xapian is versatile and can be adapted for a wide range of usage levels, it lacks the support that most beginning developers need. It can be relatively painful to set it up or remove it without leaving a trace.
Sphinx is another search engine written in C++. But unlike most other search engines, it’s a server, not a library, making it easier to interface it with your project using an external API. It’s a highly scalable search engine (used by Craigslist). Unfortunately, there is very little support for Python users. There is a small wrapper called sphinxsearch at the Python Package Index, but you’d be left on your own trying to make use of the more advanced features of Sphinx.
Haystack is not a search engine, per se. Rather, it’s an API that allows you to integrate a wide range of search engines into your Django project. It supports Whoosh, Xapian, Solr, and Elasticsearch. Changing your search backend is as simple as changing a single line of code in your settings file. Plus, it provides a familiar index object creation API for Django users. However, that simplicity comes at a cost. Haystack attempts to do the indexing for you so that if you decide to change your search backend at some point, you really don’t have to worry about how the rest of your search-related code is implemented. Although this is nice, it limits you in what you can do, especially if you want to do some sort of aggregation-based search (which I’ll discuss in a future post). In those instances, you want a direct access to the server, not a wrapper.
Elasticsearch is the most common search engine used by online retailers and big websites. It’s implemented in Java, highly scalable, and versatile. Because it’s a server with a NoSQL database backend, you can use it for text search, search analytics, and faceted search. In order to use it in a Django project, it uses a low-level wrapper called elasticsearch-dsl. Unlike most wrappers, the dsl offers nothing more than a means of formatting your queries into user-friendly JSON objects. In theory, you could use Elasticsearch without any wrapper through its REST API, but you’d end up writing some sort of code to structure your queries, which would end up doing the same thing as the dsl. Therefore, Elasticsearch is ideal for developers who want to take advantage of the full suite of services and features offered by a mature and well-supported search backend.
Elasticsearch may not be the right choice for a small project. It would be an overkill to do so. In addition, it’s resource intensive. In fact, it can cripple your machine if not configured properly. Nevertheless, I decided to use it for Fooreviews because of its versatility and the learning opportunity it presented. In the next few posts, I will talk about how to implement it into a Django project. Along the way, I will show you how to overcome some its shortcomings.