logo

Get in touch

Awesome Image Awesome Image

ElasticSearch Engine December 18, 2023

Elasticsearch Analyzers: A Guide to Field-Level Optimization and Indexing Strategies

Written by Mahipalsinh Rana

1.3K

Elasticsearch Analyzer Guide

Elasticsearch analyzers are a fundamental aspect of text processing, shaping how data is indexed and searched within the system. In addition to the default analyzer, Elasticsearch offers a range of specialized analyzers tailored to specific needs. In this blog post, we will delve into analyzers such as Keyword, Language, Pattern, Simple, Standard, Stop, and Whitespace. Understanding when to use each analyzer will empower you to optimize your Elasticsearch setup for diverse scenarios.

What are Elasticsearch Analyzers?

Elasticsearch analyzers are a critical component of the Elasticsearch search engine, and they are meant to process and index text data for speedy and accurate search operations. Character filters, tokenizers, and token filters are the three primary components of an analyzer.

Tokenizers separate the text into individual tokens, while token filters change or filter these tokens. Elasticsearch can handle activities like stemming (reducing words to their root form), lowercasing, and deleting stop words using analyzers to improve the quality of search results.

Elasticsearch comes with default analyzers for a variety of languages, and users may also develop custom analyzers to meet specific indexing and search needs. Configuring analyzers well is critical for optimizing search functionality in Elasticsearch and increasing the relevancy of search results.

Must Read: Explore Elasticsearch and Why It’s Worth Using?

What are the key features of Elasticsearch Analyzers

  • Tokenization: Elasticsearch analyzers break down the text into tokens, the smallest meaning units. This process is essential for efficient search operations.
  • Character Filtering: Character filters in analyzers preprocess input text by applying transformations or substitutions to characters before tokenization, allowing for cleaner and standardized data.
  • Token Filtering: After tokenization, analyzers employ token filters to modify or filter tokens. This step includes actions like stemming, lowercasing and removing stop words to improve the relevance of search results.
  • Multilingual Support: Elasticsearch analyzers are designed to handle diverse language datasets, providing support for multilingual text analysis and indexing.
  • Default Analyzers: Elasticsearch comes with default analyzers for various languages, offering convenient out-of-the-box solutions for common scenarios.
  • Index-Time and Query-Time Analysis: Analyzers operate at both index time and query time. This dual functionality allows for flexibility in how text is processed during data indexing and user search queries.
  • Stemming: Analyzers support stemming, the process of reducing words to their root form, which enhances the inclusiveness of search results by capturing variations of a word.

 

Explore firsthand the functionality of Elasticsearch analyzers through practical code demonstrations. These examples serve as a gateway to understanding the inner workings of analyzers, showcasing how they facilitate efficient indexing and powerful search capabilities within Elasticsearch. Mastering these analyzers not only aids in refining Elasticsearch queries but also enhances overall indexing strategies for optimal performance.

1. Simple Analyzer

The simple analyzer breaks text into tokens at any non-letter character, such as numbers, spaces, hyphens, and apostrophes, discards non-letter characters, and changes uppercase to lowercase.

The simple analyzer is defined by one tokenizer which is a lowercase tokenizer.

Example

POST _analyze
{
“analyzer”: “simple”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}

Tokens generated

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Use Case: Basic Tokenization

Scenario: In situations where a simple tokenization approach is sufficient, such as when dealing with less structured or informal text, the simple analyzer provides a straightforward solution without extensive filtering.

Mapping:

“mappings”: {

“properties”: {

“text_field”: {

“type”: “text”,

“analyzer”: “simple”

}

}

}

2. Standard analyzer

The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.

Example 

POST _analyze
{
“analyzer”: “standard”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}

Token Generated

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone]

Use Case: Common English Words Inclusion

Scenario: Use the standard analyzer when you want to index and search for common words while maintaining tokenization and lowercase conversion.

Mapping:

“mappings”: {

“properties”: {

“text_field”: {

“type”: “text”,

“analyzer”: “standard”

}

}

}

3. Keyword analyzer

The keyword analyzer is a “noop” analyzer that returns the entire input string as a single token.

Example  

POST _analyze
{
“analyzer”: “keyword”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}

Token Generated

[ The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.]

Use Case: Exact Match Searches

Scenario: You have identifiers like product codes, document IDs, or tags that should not be tokenized. The keyword analyzer is suitable for scenarios where you need to search for exact matches without breaking down the input into individual words.

Mapping:

“mappings”: {

“properties”: {

“keyword_field”: {

“type”: “keyword”,

“analyzer”: “keyword”

}

}

}

4. Whitespace analyzer 

The whitespace analyzer breaks text into terms whenever it encounters a whitespace character.

Example

POST _analyze
{
“analyzer”: “keyword”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}

Token Generated

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone.]

Use Case: Maintain Text Structure

Scenario: Your data has distinct terms separated by whitespace, and you want to preserve this structure. The whitespace analyzer tokenizes the input based on whitespace characters, allowing you to index and search for terms as they appear in the original text.

Mapping:

“mappings”: {

“properties”: {

“text_field”: {

“type”: “text”,

“analyzer”: “whitespace”

}

}

}

5. Pattern analyzer

The pattern analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators, not the tokens themselves. The regular expression defaults to \W+ (or all non-word characters).

Example

POST _analyze
{
“analyzer”: “pattern”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}

Token Generated

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Use Case: Custom Text Formats

Scenario: You have structured data with specific patterns or custom text formats that need specialized parsing. The pattern analyzer allows you to define regular expressions for tokenization, making it suitable for scenarios where a predefined structure exists. Examples: emails, phone numbers, dates, etc.

Mapping:

“mappings”: {

“properties”: {

“custom_field”: {

“type”: “text”,

“analyzer”: “pattern”,

“pattern”: “\\s*,\\s*”  // Example: Tokenize by commas with optional spaces

}

}

}

6. Stop analyzer 

The stop analyzer is the same as the simple analyzer but adds support for removing stop words. It defaults to using the _english_ stop words. The common stop words in English are is, on, the, a, an, etc.

Example 

POST _analyze
{
“analyzer”: “stop”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}

Token Generated

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone]

Use Case: Case-Sensitive Searches with Stop Word Removal

Scenario: You require case-sensitive searches but want to exclude common stop words. The stop analyzer allows you to maintain case sensitivity while filtering out frequently occurring words that may not add significant value to your search results.

Mapping:

“mappings”: {

“properties”: {

“text_field”: {

“type”: “text”,

“analyzer”: “stop”

}

}

}

7. Language Analyzer 

It is a tailored analyzer for specific languages (e.g., English, Spanish, French). Incorporates language-specific tokenization and stemming rules for more accurate and context-aware indexing.

Example to add bengali custom analyzer

PUT /bengali_example
{
“settings”: {
“analysis”: {
“filter”: {
“bengali_stop”: {
“type”:   “stop”,
“stopwords”:  “_bengali_”
},
“bengali_keywords”: {
“type”:   “keyword_marker”,
“keywords”:   [“উদাহরণ”]
},
“bengali_stemmer”: {
“type”:   “stemmer”,
“language”:   “bengali”
}
},
“analyzer”: {
“rebuilt_bengali”: {
“tokenizer”:  “standard”,
“filter”: [
“lowercase”,
“decimal_digit”,
“bengali_keywords”,
“indic_normalization”,
“bengali_normalization”,
“bengali_stop”,
“bengali_stemmer”
]
}
}
}
}
}

With the following analyzer, you would be able to analyze bengali words bengali with Bengali stop words and stemmers.

Use Case: Multilingual Content

Scenario: Your dataset includes documents in different languages. By using language-specific analyzers (e.g., English, Spanish, French), you can account for language-specific tokenization and stemming, improving the accuracy of search results in diverse linguistic contexts.

Conclusion 

Elasticsearch provides a rich set of analyzers catering to various use cases. Whether dealing with multilingual content, structured data, or specific tokenization needs, selecting the right analyzer is key to achieving efficient and accurate search results. By understanding the nuances of analyzers like Keyword, Language, Pattern, Simple, Standard, Stop, and Whitespace, you can fine-tune your Elasticsearch setup for optimal performance and relevance in diverse scenarios. Partnering with experts in ElasticSearch Consulting and Development Services can further amplify your Elasticsearch capabilities for tailored and effective solutions.

Bringing Software Development Expertise to Every
Corner of the World

United States

India

Germany

United Kingdom

Canada

Singapore

Australia

New Zealand

Dubai

Qatar

Kuwait

Finland

Brazil

Netherlands

Ireland

Japan

Kenya

South Africa