Document Indexes: Solr, Elasticsearch, and OpenSearch

Introduction

I first encountered document indexes in 2012 while working at a startup that provided product reviews and ratings for e-commerce sites. We were dealing with a lot of data from thousands of online sellers, and as we scaled, the complexity of managing this data grew. This led me to explore Apache Solr, based on Apache Lucene.

A bit about search

Search is something we take for granted today, but it’s pretty complex. We expect to be able to search in an online store, not just for keywords, but for individual words from the title, from the descriptions, and we expect the results to be smart and relevant.
There’s lots we can do when providing search, to both our documents, and the users queries, to try and increase the usefulness (and profitability) of their results.

Stemming and Tokenization: Imagine searching for “running” but also finding results for “run” or “runner.” Our search engine needs to break down words (tokens) to their root forms (stems), for broader and more relevant results.
Weighting/Ranking We may want documents where the query appears in the title to rank higher than in the description, or perhaps we may want to “boost” certain products.
Synonyms There’re loads of ways of saying the same thing, for example when describing a couch as a sofa, this is also the case when brand names become generic - a search for Thermos, should rank products from the brand Thermos high, but then treat thermos as a synonym for insulated flask.
Multilingual Search Different languages have different rules, and therefore the inputs need analyzing appropriately - using a Stemmer designed for English will have terrible results if applied to French.
Fuzziness Typos happen. Search engines need to handle a certain degree of fuzziness, understanding that “shirst” might be a misspelling of “shirt” and still returning relevant results.

All of the above, and more, just help us take input text and return matching documents.
On top of this our customers want to be able to filter, seeing only products in their price range, or with a certain review count, and every time they change those filters, we need to update the results, count (and other options) available to them.

In conclusion, displaying a good, filterable, search interface is hard - it requires a lot of queries, and a lot of set up.

My first crush: Solr

My journey with document indexing began in 2012 at a SAAS startup grappling with massive product review data from thousands of online sellers. Relational databases, segregated by customer, made gaining a system-wide view difficult. Apache Solr, built on Apache Lucene, was a great solution – enabling efficient ad-hoc queries, KPI generation, and facet pivoting made for excellent product search.

In 2013, I joined a leading e-commerce giant facing challenges with product exports for price comparison sites. Solr once again proved its worth. Using its “response writers” to transform query outputs, we were able to slashing processing time and system load, generating product export CSVs in seconds rather than tens of minutes. Having the index also allowed us to massively overhaul the shops functionality, improving user experience with a huge array of different filters and search options, which had a noticeable effect on the bottom line. The ability to perform complex, flexible queries, on a huge dataset with good performance was an absolute game changer.

Enter ElasticSearch

The following year, a new project presented itself – a distributed multi-lingual e-commerce platform with a focus on filterability. While I initially considered Solr, the project’s use of CouchDB (known for excellent master-master replication) drew me towards Elasticsearch. This rising star, also built on Lucene, offered a similar feature set to Solr with some distinct advantages.

Elasticsearch quickly became my go-to choice. Its user-friendliness, powerful stemming/tokenization/mapping capabilities, and multilingual search were impressive. The excellent documentation and an HTTP API for everything from sharding to data management made it incredibly flexible.

A Word on OpenSearch

Amazon’s managed Elasticsearch service in their cloud sparked the creation of OpenSearch, a fork of Elasticsearch due to a licensing disagreements. OpenSearch retains most core Elasticsearch features, making it a strong alternative, particularly for AWS users seeking an open-source option.

Search as a service

While Solr and Elasticsearch/OpenSearch excel in self-hosted deployments, a range of “search-as-a-service” providers exist. One popular option I’ve worked with is Algolia, which offers a robust solution for many use cases.

Making the Right Choice

Assuming “search as a service” isn’t an option for you, both Solr and Elasticsearch/OpenSearch are excellent Lucene-based search engines, but cater to different needs. Here’s a breakdown to guide your decision:

Feature	Solr	Elasticsearch / OpenSearch
Dataset Size	Smaller datasets	Large datasets, highly scalable
Ease of Use	More user-friendly interface	Requires more technical expertise
Integrations	Integrates well with Apache ecosystem	Richer plugin and integration ecosystem
Key Strength	Faceted browsing, response writers	Performance, complex queries, scalability
Licensing	Open-source (Apache 2.0)	Proprietary / Open source
Ideal for	Smaller e-commerce sites, ease of use	Large e-commerce sites, complex queries, high scalability
Cloud Adoption	Vendor-agnostic	Integrates seamlessly with AWS (OpenSearch)

Conclusion

Over the past decade, Elasticsearch and OpenSearch have become my preferred tools for search, data discovery, and beyond, often used in conjunction with Kibana for visualizations (ELK Stack). While I’ve transitioned away from Solr in most cases, it remains a valuable solution for specific scenarios.