Introduction
Apache Solr is a powerful, scalable, and flexible search platform. However, traditional keyword-based search often fails to deliver context-aware or semantically enriched results. This is where Natural Language Processing (NLP) comes in. In this blog, I’ll walk you through how I integrated NLP into Apache Solr to improve search relevance by leveraging spaCy for entity extraction and WordNet for synonym expansion.
This post outlines the entire process—right from data preprocessing to Solr schema enhancements and custom query pipelines. Whether you’re building enterprise search or intelligent assistants, these techniques can help you deliver smarter search experiences.
Why Integrate NLP with Solr?
By default, Solr performs lexical (exact match) search. However, users often phrase queries in natural language or use synonyms and variations. Integrating NLP can help:
• Extract key entities from documents (using spaCy)
• Expand query terms with domain-specific or WordNet-based synonyms
• Normalize and clean text (lemmatization, stemming, stopword removal)
• Enable concept-based or semantic search
Architecture Overview
Here’s the high-level architecture of our NLP-enhanced Solr pipeline:
Raw Text (docs) --> NLP Preprocessing (spaCy) --> Tokenization, Lemmatization, Synonym Expansion (WordNet)
--> Enriched Documents --> Solr Indexing
User Query --> NLP Query Parser --> Synonym/Entity Expansion --> Solr Search
Tools Used
• Apache Solr 8.x/9.x
• spaCy (for entity recognition and lemmatization)
• NLTK/WordNet (for synonym expansion)
• Python (for preprocessing pipeline)
• Solr REST APIs (for indexing & querying)
Step-by-Step Implementation
Step 1: Set Up Solr
Install Solr and create a new core or collection (e.g., nlp_search):
bin/solr create -c nlp_search
Edit the schema for custom fields:
<field name="content" type="text_general" indexed="true" stored="true"/>
<field name="lemmas_txt" type="text_general" indexed="true" stored="true"/>
<field name="entities_txt" type="text_general" indexed="true" stored="true"/>
<field name="synonyms_txt" type="text_general" indexed="true" stored="true"/>
Step 2: Preprocessing with spaCy
Install and load spaCy:
pip install spacy
python -m spacy download en_core_web_sm
Example Python code to extract lemmas and named entities:
import spacy
nlp = spacy.load("en_core_web_sm")
def extract_nlp_features(text):
doc = nlp(text)
lemmas = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
entities = [ent.text for ent in doc.ents]
return list(set(lemmas)), list(set(entities))
Step 3: Synonym Expansion Using WordNet
Install WordNet and extract synonyms:
pip install nltk
python -m nltk.downloader wordnet
from nltk.corpus import wordnet
def get_synonyms(word):
synonyms = set()
for syn in wordnet.synsets(word):
for lemma in syn.lemmas():
if lemma.name().lower() != word.lower():
synonyms.add(lemma.name().replace('_', ' '))
return list(synonyms)
Combine all features:
Step 4: Indexing Enriched Data to Solr
def enrich_text(text):
lemmas, entities = extract_nlp_features(text)
synonyms = []
for lemma in lemmas:
synonyms.extend(get_synonyms(lemma))
return {
"content": text,
"lemmas_txt": list(set(lemmas)),
"entities_txt": list(set(entities)),
"synonyms_txt": list(set(synonyms))
}
Use pysolr or plain REST API to index data:
import pysolr
solr = pysolr.Solr("http://localhost:8983/solr/nlp_search", always_commit=True)
data = enrich_text("Apple announced a new iPhone with advanced camera features.")
solr.add([data])
Sample enriched document:
{
"content": "Apple announced a new iPhone with advanced camera features.",
"lemmas_txt": ["announce", "feature", "camera", "iphone"],
"entities_txt": ["Apple"],
"synonyms_txt": ["declare", "proclaim", "camera lens"]
}
Step 5: Enhancing Queries with NLP
Use the same preprocessing pipeline on incoming queries:
def preprocess_query(user_query):
lemmas, entities = extract_nlp_features(user_query)
synonyms = []
for lemma in lemmas:
synonyms.extend(get_synonyms(lemma))
expanded_terms = lemmas + synonyms + entities
return " ".join(set(expanded_terms))
Solr query example:
search_query = preprocess_query("latest iPhone camera")
results = solr.search(f"lemmas_txt:({search_query}) OR synonyms_txt:({search_query})")
This allows searching even if the document used terms like “announce” or “camera lens” but the user searched “launch” or “photography device.”
Step 6: Optional — Custom Synonym File in Solr
You can configure a custom synonym filter in Solr:
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
Add domain-specific synonyms to synonyms.txt:
iphone, apple phone
camera, lens, shooter
launch, announce, release
Reload core after making changes.
Benefits Achieved
• Context-aware Search: spaCy identifies people, organizations, dates, etc.
• Synonym Tolerance: WordNet expands user vocabulary automatically
• Better Recall and Precision: Users find relevant results without exact term match
• Scalable and Modular: NLP is decoupled from Solr, easy to evolve
Conclusion
Integrating NLP with Apache Solr elevates the search experience from literal matching to semantic understanding. By using spaCy for linguistic enrichment and WordNet for synonym expansion, we can build search solutions that understand the user’s intent and content meaning.
I’ve successfully used this architecture in multiple projects including AI search, document Q&A, and enterprise search portals. You can further enhance this by integrating vector embeddings or LLMs, but this root-level integration forms a strong baseline for intelligent search.
Want Help with NLP + Solr?
Reach out if you’re building search applications, knowledge engines, or AI assistants. I offer consulting, POCs, and implementation guidance.