Boosting Search Intelligence: NLP Integration with Apache Solr Using spaCy and WordNet

Introduction

Apache Solr is a powerful, scalable, and flexible search platform. However, traditional keyword-based search often fails to deliver context-aware or semantically enriched results. This is where Natural Language Processing (NLP) comes in. In this blog, I’ll walk you through how I integrated NLP into Apache Solr to improve search relevance by leveraging spaCy for entity extraction and WordNet for synonym expansion.

This post outlines the entire process—right from data preprocessing to Solr schema enhancements and custom query pipelines. Whether you’re building enterprise search or intelligent assistants, these techniques can help you deliver smarter search experiences.

Why Integrate NLP with Solr?

By default, Solr performs lexical (exact match) search. However, users often phrase queries in natural language or use synonyms and variations. Integrating NLP can help:

• Extract key entities from documents (using spaCy)

• Expand query terms with domain-specific or WordNet-based synonyms

• Normalize and clean text (lemmatization, stemming, stopword removal)

• Enable concept-based or semantic search

Architecture Overview

Here’s the high-level architecture of our NLP-enhanced Solr pipeline:

Raw Text (docs) --> NLP Preprocessing (spaCy) --> Tokenization, Lemmatization, Synonym Expansion (WordNet)
--> Enriched Documents --> Solr Indexing

User Query --> NLP Query Parser --> Synonym/Entity Expansion --> Solr Search

Tools Used

• Apache Solr 8.x/9.x

• spaCy (for entity recognition and lemmatization)

• NLTK/WordNet (for synonym expansion)

• Python (for preprocessing pipeline)

• Solr REST APIs (for indexing & querying)

Step-by-Step Implementation

Step 1: Set Up Solr

Install Solr and create a new core or collection (e.g., nlp_search):

bin/solr create -c nlp_search

Edit the schema for custom fields:

<field name="content" type="text_general" indexed="true" stored="true"/>
<field name="lemmas_txt" type="text_general" indexed="true" stored="true"/>
<field name="entities_txt" type="text_general" indexed="true" stored="true"/>
<field name="synonyms_txt" type="text_general" indexed="true" stored="true"/>

Step 2: Preprocessing with spaCy

Install and load spaCy:

pip install spacy
python -m spacy download en_core_web_sm

Example Python code to extract lemmas and named entities:

import spacy

nlp = spacy.load("en_core_web_sm")

def extract_nlp_features(text):
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    entities = [ent.text for ent in doc.ents]
    return list(set(lemmas)), list(set(entities))

Step 3: Synonym Expansion Using WordNet

Install WordNet and extract synonyms:

pip install nltk
python -m nltk.downloader wordnet

from nltk.corpus import wordnet

def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            if lemma.name().lower() != word.lower():
                synonyms.add(lemma.name().replace('_', ' '))
    return list(synonyms)

Combine all features:

Step 4: Indexing Enriched Data to Solr

def enrich_text(text):
    lemmas, entities = extract_nlp_features(text)
    synonyms = []
    for lemma in lemmas:
        synonyms.extend(get_synonyms(lemma))
    return {
        "content": text,
        "lemmas_txt": list(set(lemmas)),
        "entities_txt": list(set(entities)),
        "synonyms_txt": list(set(synonyms))
    }

Use pysolr or plain REST API to index data:

import pysolr

solr = pysolr.Solr("http://localhost:8983/solr/nlp_search", always_commit=True)

data = enrich_text("Apple announced a new iPhone with advanced camera features.")
solr.add([data])

Sample enriched document:

{
  "content": "Apple announced a new iPhone with advanced camera features.",
  "lemmas_txt": ["announce", "feature", "camera", "iphone"],
  "entities_txt": ["Apple"],
  "synonyms_txt": ["declare", "proclaim", "camera lens"]
}

Step 5: Enhancing Queries with NLP

Use the same preprocessing pipeline on incoming queries:

def preprocess_query(user_query):
    lemmas, entities = extract_nlp_features(user_query)
    synonyms = []
    for lemma in lemmas:
        synonyms.extend(get_synonyms(lemma))
    expanded_terms = lemmas + synonyms + entities
    return " ".join(set(expanded_terms))

Solr query example:

search_query = preprocess_query("latest iPhone camera")
results = solr.search(f"lemmas_txt:({search_query}) OR synonyms_txt:({search_query})")

This allows searching even if the document used terms like “announce” or “camera lens” but the user searched “launch” or “photography device.”

Step 6: Optional — Custom Synonym File in Solr

You can configure a custom synonym filter in Solr:

<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" expand="true" ignoreCase="true"/>

Add domain-specific synonyms to synonyms.txt:

iphone, apple phone
camera, lens, shooter
launch, announce, release

Reload core after making changes.

Benefits Achieved

• Context-aware Search: spaCy identifies people, organizations, dates, etc.

• Synonym Tolerance: WordNet expands user vocabulary automatically

• Better Recall and Precision: Users find relevant results without exact term match

• Scalable and Modular: NLP is decoupled from Solr, easy to evolve

Conclusion

Integrating NLP with Apache Solr elevates the search experience from literal matching to semantic understanding. By using spaCy for linguistic enrichment and WordNet for synonym expansion, we can build search solutions that understand the user’s intent and content meaning.

I’ve successfully used this architecture in multiple projects including AI search, document Q&A, and enterprise search portals. You can further enhance this by integrating vector embeddings or LLMs, but this root-level integration forms a strong baseline for intelligent search.

Want Help with NLP + Solr?

Reach out if you’re building search applications, knowledge engines, or AI assistants. I offer consulting, POCs, and implementation guidance.