· engineering · 5 min read
Document Indexes: Solr, Elasticsearch, and OpenSearch
Getting visitors to the right content and products for their needs is a key concern for any platform and search plays a crucial, but complicated role.
Table of Contents
Introduction
I first encountered document indexes in 2012 while working at a startup that provided product reviews and ratings for e-commerce sites. We were dealing with a lot of data from thousands of online sellers, and as we scaled, the complexity of managing this data grew. This led me to explore Apache Solr, based on Apache Lucene.
A bit about search
Search is something we take for granted today, but it’s pretty complex. We expect to be able to search in an online store, not just for keywords, but for individual words from the title, from the descriptions, and we expect the results to be smart and relevant.
There’s lots we can do when providing search, to both our documents, and the users queries, to try and increase the usefulness (and profitability) of their results.
Stemming and Tokenization: Imagine searching for “running” but also finding results for “run” or “runner.” Our search engine needs to break down words (tokens) to their root forms (stems), for broader and more relevant results.
Weighting/Ranking We may want documents where the query appears in the title to rank higher than in the description, or perhaps we may want to “boost” certain products.
Synonyms There’re loads of ways of saying the same thing, for example when describing a couch as a sofa, this is also the case when brand names become generic - a search for Thermos, should rank products from the brand Thermos high, but then treat thermos as a synonym for insulated flask.
Multilingual Search Different languages have different rules, and therefore the inputs need analyzing appropriately - using a Stemmer designed for English will have terrible results if applied to French.
Fuzziness Typos happen. Search engines need to handle a certain degree of fuzziness, understanding that “shirst” might be a misspelling of “shirt” and still returning relevant results.
All of the above, and more, just help us take input text and return matching documents.
On top of this our customers want to be able to filter, seeing only products in their price range, or with a certain review count, and every time they change those filters, we need to update the results, count (and other options) available to them.
In conclusion, displaying a good, filterable, search interface is hard - it requires a lot of queries, and a lot of set up.
My first crush: Solr
My journey with document indexing began in 2012 at a SAAS startup grappling with massive product review data from thousands of online sellers. Relational databases, segregated by customer, made gaining a system-wide view difficult. Apache Solr, built on Apache Lucene, was a great solution – enabling efficient ad-hoc queries, KPI generation, and facet pivoting made for excellent product search.
In 2013, I joined a leading e-commerce giant facing challenges with product exports for price comparison sites. Solr once again proved its worth. Using its “response writers” to transform query outputs, we were able to slashing processing time and system load, generating product export CSVs in seconds rather than tens of minutes. Having the index also allowed us to massively overhaul the shops functionality, improving user experience with a huge array of different filters and search options, which had a noticeable effect on the bottom line. The ability to perform complex, flexible queries, on a huge dataset with good performance was an absolute game changer.
Enter ElasticSearch
The following year, a new project presented itself – a distributed multi-lingual e-commerce platform with a focus on filterability. While I initially considered Solr, the project’s use of CouchDB (known for excellent master-master replication) drew me towards Elasticsearch. This rising star, also built on Lucene, offered a similar feature set to Solr with some distinct advantages.
Elasticsearch quickly became my go-to choice. Its user-friendliness, powerful stemming/tokenization/mapping capabilities, and multilingual search were impressive. The excellent documentation and an HTTP API for everything from sharding to data management made it incredibly flexible.
A Word on OpenSearch
Amazon’s managed Elasticsearch service in their cloud sparked the creation of OpenSearch, a fork of Elasticsearch due to a licensing disagreements. OpenSearch retains most core Elasticsearch features, making it a strong alternative, particularly for AWS users seeking an open-source option.
Search as a service
While Solr and Elasticsearch/OpenSearch excel in self-hosted deployments, a range of “search-as-a-service” providers exist. One popular option I’ve worked with is Algolia, which offers a robust solution for many use cases.
Making the Right Choice
Assuming “search as a service” isn’t an option for you, both Solr and Elasticsearch/OpenSearch are excellent Lucene-based search engines, but cater to different needs. Here’s a breakdown to guide your decision:
Feature | Solr | Elasticsearch / OpenSearch |
---|---|---|
Dataset Size | Smaller datasets | Large datasets, highly scalable |
Ease of Use | More user-friendly interface | Requires more technical expertise |
Integrations | Integrates well with Apache ecosystem | Richer plugin and integration ecosystem |
Key Strength | Faceted browsing, response writers | Performance, complex queries, scalability |
Licensing | Open-source (Apache 2.0) | Proprietary / Open source |
Ideal for | Smaller e-commerce sites, ease of use | Large e-commerce sites, complex queries, high scalability |
Cloud Adoption | Vendor-agnostic | Integrates seamlessly with AWS (OpenSearch) |
Conclusion
Over the past decade, Elasticsearch and OpenSearch have become my preferred tools for search, data discovery, and beyond, often used in conjunction with Kibana for visualizations (ELK Stack). While I’ve transitioned away from Solr in most cases, it remains a valuable solution for specific scenarios.
About James Babington
A cloud architect and engineer with a wealth of experience across AWS, web development, and security, James enjoys writing about the technical challenges and solutions he's encountered, but most of all he loves it when a plan comes together and it all just works.
No comments yet. Be the first to comment!