Enterprise Search Engines

TODO

The Complete Guide to the ELK Stack https://logz.io/learn/complete-guide-elk-stack/

Enterprise Search Engines Concepts

Analogy for easy understanding

Enterprise Search Engines Database
Index An index is an equivalent of a relational database
Document Table
Field Column
OpenSearch/Elasticsearch Terminology Description
Cluster A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes.
A cluster is identified by a unique name which by default is ‘opensearch’ (for opensearch clusters) and ‘elasticsearch’ for elasticsearch clusters. .
The cluster name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster.
For instance you could use logging-dev, logging-stage, and logging-prod for the development, staging, and production clusters.
An OpenSearch cluster is one or more OpenSearch nodes with the same cluster identification.
Node A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities.
Just like a cluster, a node is identified by a name.
You can define any node name you want if you do not want the default.
This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your OpenSearch/Elasticsearch cluster.
An OpenSearch node is a single OpenSearch process, and the minimum number of nodes for a highly available OpenSearch cluster is three.
Index An index is a collection of documents that have somewhat similar characteristics.
In a single cluster, you can define as many indexes as you want.
An index is an equivalent of a relational database.
An OpenSearch index is a collection of documents in OpenSearch. Each index is split into shards.
Type Type is the OpenSearch/Elasticsearch meta object where the mapping for an index is stored.
Alias Alias is a reference to an OpenSearch/Elasticsearch index. An alias can be mapped to more than one index.
Document A document is a basic unit of information that can be indexed.
This document is expressed in JSON format.
Connected query returns parent and child rows. Child information is attached to the main query and is sent as one document.
Shard Elasticsearch provides the ability to subdivide your index into multiple pieces called shards.
OpenSearch shards enable parallelization of data processing across both single and multiple OpenSearch nodes.
By default, OpenSearch automatically manages shard allocation within the node(s).
Optimizing shards is an important component of improving OpenSearch performance.
OpenSearch provides the ability to subdivide your index into multiple pieces called shards.
When you create an index, you can simply define the number of shards that you want.
Each shard is in itself a fully-functional and independent ‘index’ that can be hosted on any node in the cluster.
Replica OpenSearch/Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.
OpenSearch replicas serve as a backup for shards and also aid in search performance by providing additional capacity.
OpenSearch automatically creates five primary shards and one replica for every index.
You can add or remove replicas at any time to scale out query processing.
After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the-fact.
Port The default OpenSearch port is 9200/tcp.
The OpenSearch port can be modified in the configuration file, opensearch.yml.
Query OpenSearch queries are sub-divided into two categories: leaf queries and compound queries.
OpenSearch leaf queries search for specific values within a field or field(s).
OpenSearch compound queries combine multiple queries together.
Pagination OpenSearch pagination is the setting to return a maximum number of results.
This number changes frequently.
OpenSearch pagination can be changed by adding a size parameter to the search request.
Managed OpenSearch Managed Opensearch provides 24/7 monitoring, support, and maintenance to maximize performance and uptime.
Managed OpenSearch is typically provided by a team of engineers who have extensive experience with OpenSearch management.
Hosted OpenSearch Hosted OpenSearch is a type of Managed OpenSearch where the service provider hosts their clients’ clusters in the service provider’s own environment.
Hosted OpenSearch tends to increase latency and cost more than when OpenSearch is run in a client’s own environment. It also opens up the client to additional security risk.

Reading material

  1. https://en.wikipedia.org/wiki/Apache_Lucene
  2. https://docs.oracle.com/cd/F44947_01/pt858pbr3/eng/pt/tpst/concept_ElasticsearchConceptsAndTerminology.html
  3. https://docs.oracle.com/cd/F88569_01/pt861pbr1/eng/pt/tpst/ConceptsAndTerminology.html
  4. https://dattell.com/data-architecture-blog/opensearch-terms-and-definitions/
  5. https://docs.oracle.com/cd/F44947_01/pt858pbr3/eng/pt/tpst/concept_ElasticsearchConceptsAndTerminology.html

Tags

  1. Elasticsearch and OpenSearch
  2. Enterprise Search Engines - helpful queries and commands
  3. Enterprise Search Engines Java Clients
  4. Switching from using Elasticsearch to OpenSearch

Links to this note