OpenSearch plugins
Managed Service for OpenSearch has a number of pre-installed plugins
Additional pre-installed plugins
-
repository-s3
Adds support for AWS S3
as a repository for snapshots.
Additional OpenSearch plugins
All supported additional plugins are listed here:
-
analysis-icu
Adds the Lucene ICU module with extended Unicode support using the ICU
libraries. It improves analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration. -
analysis-kuromoji
Adds the Lucene kuromoji analysis module for the Japanese language.
-
analysis-nori
Adds the Lucene nori analysis module for the Korean language. Uses the mecab-ko-dic
dictionary. -
analysis-phonetic
Provides lexeme filters that convert expressions to their phonetic representation using Soundex, Metaphone, and other algorithms.
-
analysis-smartcn
Adds the Lucene Smart Chinese analysis module for Chinese or mixed Chinese-English text.
-
analysis-stempel
Adds the Lucene Stempel analysis module for the Polish language.
-
analysis-ukrainian
Adds the Lucene UkrainianMorfologikAnalyzer analysis module providing stemming for the Ukrainian language. Uses the Morfologik
project. -
ingest-attachment
Extracts file attachments in common formats (such as PPT, XLS, and PDF) using the Apache Tika™
text extraction library. -
mapper-annotated-text
Indexes plain text combined with a special markup. Such texts are typically used to identify items of interest, e.g., people or organizations.
-
mapper-murmur3
Computes hashes of field values at the index time and stores them in the index.
-
mapper-size
Provides the
_sizemetadata field that indexes the size of the original_sourcefield in bytes. -
repository-azure
Adds support for Azure Blob Storage
as a repository for snapshots. -
repository-gcs
Adds support for Google Cloud Storage
as a repository for snapshots. -
repository-hdfs
Adds support for the HDFS file system as a repository for snapshots.
-
transport-nio
Server/client non-blocking networking library built with Netty. This plugin only supports OpenSearch 2.x
. -
yandex-lemmer
Adds the yandex-lemmer filter that improves text search in Russian across OpenSearch documents.
For more information, see this OpenSearch article
Why use the yandex-lemmer plugin
The yandex-lemmer filter analyzes a search query in Russian in more detail and makes search more effective. Without this filter, you may get irrelevant results in the following cases:
-
The forms of the same word vary widely. For example, a search for
pyos(Russian for dog) will not find a document withpsa(the genitive case ofpyos). Similarly, a search foridti(Russian for go, walk, etc.) will fail to retrieve a document withshli(a past tense form foridti). -
Words with different meanings have the same or similar spelling. As a result, the search may retrieve irrelevant documents that do not match your query. For example, a search for
alaya(Russian for scarlet) will find documents that contain both adjectives (alaya) and nouns (the genitive case ofAlai, which is the name of a river, e.g.,na beregu Alayaor on the bank of Alai in Russian). -
There are typos in the text which are not immediately obvious. For example, in
Cловарь(Slovar, which is Russian for dictionary), the first letter is LatinC. If you enter the whole word in Cyrillic in your query, the search will not find the document with the misspelled word.
How search with yandex-lemmer works:
-
OpenSearch parses the query text with the Apache Lucene
search library. During parsing, the tokenizer breaks the text down into individual tokens. A token usually consists of a word and its metadata. For example, a tokenizer can divide a sentence such asOdin v pole ne voin(a Russian proverb meaning One man in the filed is no warrior) into tokens based on the spaces between words:Odin,v,pole,ne,voin. -
A sequence of tokens goes through a chain of filters. Each filter receives a token, analyzes it, and returns one or more tokens. For example, a filter converting text to lower case will return the same number of tokens as it received in the input:
odin,v,pole,ne,voin. Also, a token may be removed from the output after applying a stop word filter. -
When the
yandex_lemmerfilter gets a token, it searches for the root form of the word. If it finds one or more roots forms (e.g., for homonyms or homographs), the filter first returns the initial token and then those with the word’s root forms.Example of output tokens for
idut dozhdi(it’s been raining){ "tokens": [ { "token": "idut", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "idti", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "dozhdi", "start_offset": 5, "end_offset": 10, "type": "<ALPHANUM>", "position": 1 }, { "token": "dozhd", "start_offset": 5, "end_offset": 10, "type": "<ALPHANUM>", "position": 1 } ] }In the example above, the search will find phrases with any form of
idti(e.g.,shyol,shli, oridut) and any form ofdozhd(e.g.,dozhdiordozhdey).Each output token is assigned a rank indicating the word’s position in the search phrase. If an input token has multiple associated tokens in the output, all of them will have the same rank. This helps to determine the distance between the words in a search query, e.g., if enclosed in quotes. Thus, a search for
"idut dozhdi"(it’s been raining) will retrieve a document withidyot dozhd(it’s raining), but not the one withidyot silny dozhd(it’s raining hard).