Enterprise Search Engines - helpful queries and commands

The operations of Elasticsearch are available as REST APIs. The primary functions are:

  1. storing documents in an index,
  2. searching the index with powerful queries to fetch those documents, and
  3. run analytic functions on the data.

The commands that need to be run to look at data in OpenSearch are in this link: https://opensearch.org/docs/latest/ - Docker quickstart

Examples for performing different operations of Elasticsearch using the available REST APIs. These commands can be run from “Kibana” console.

https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html

Searching data in Amazon OpenSearch Service: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/searching.html

With Basic Authentication

If you have turned on security with ElasticSearch then you need to supply the user and password like shown below to every curl command:

curl -X GET 'http://localhost:9200/_cat/indices?v' -u elastic:(password)

Pretty Print

Add ?pretty=true to any search to pretty print the JSON. Like this:

curl -X GET 'http://localhost:9200/(index)/_search'?pretty=true

Create an index

https://opensearch.org/docs/latest/field-types/

Indexing (Insert data)

adding single document

adding one document at a time, or

The API for adding individual documents accepts a document as a parameter.

PUT /messages/_doc/1 {
  "message": "The Sky is blue today"
}
POST /movies/_doc
{
  "name" : "The Godfather",
  "director" : "Francis Ford Coppola",
  "age" : 52,
  "year" : "1972"
}
POST /movies/_doc
{
  "name" : "Goodfellas",
  "director" : "Martin Scorsese",
  "age" : 33,
  "year" : "1990"
}

Using curl

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/1 -d '{
    "school" : "Harvard"
}'

How does _id get generated for the inserted document?

If we don’t specify a value for the id when inserting (indexing) a document, a random value is generated and assigned to it.

IndexRequest<MyDocument> indexRequest = new IndexRequest.Builder<MyDocument>().index(indexName2)
                .document(myDocument)
                .build();

        log.info("indexRequest: {}", javaToJson(indexRequest));

        OpenSearchClient openSearchClient = getOpenSearchClient();
        IndexResponse indexResponse = openSearchClient.index(indexRequest);
        return indexResponse.id();

However, if your model has a field that is going to be unique, use it as the value for _id. This will make updates much easier to deal with.

IndexRequest<MyDocument> indexRequest = new IndexRequest.Builder<MyDocument>().index(indexName2)
                .id(myDocument.getDocId())
                .document(myDocument)
                .build();

        log.info("indexRequest: {}", javaToJson(indexRequest));

        OpenSearchClient openSearchClient = getOpenSearchClient();
        IndexResponse indexResponse = openSearchClient.index(indexRequest);
        return indexResponse.id();

With this approach, if we have to update the document, all we have to do it retrieve the document, change the fields that need to be changed, and we can call the save operation again.

If we want to use the random generated _id, we have to account for it while doing updates. If we don’t use the generated _id while updating the document, we will see duplicate docs in opensearch/elasticsearch. The reason is, if we don’t specify a _id at the time of insertion, it is considered a new document by opensearch/elasticsearch.

If we cannot use a field from the model to use as id, we have to insert a field in it to hold the random generated id. After the insertion, we need to grab it from the indexResponse and we have to save it in the model.

public String insertDocument(MyDocument myDocument) throws IOException, NoSuchAlgorithmException, KeyStoreException, KeyManagementException {

    IndexResponse indexResponse = getIndexResponse(myDocument);

    if (StringUtils.isEmpty(myDocument.getDocId())) {
        myDocument.setDocId(indexResponse.id());
        getIndexResponse(myDocument);
    }

    return indexResponse.id();
}

private IndexResponse getIndexResponse(MyDocument myDocument) throws NoSuchAlgorithmException, KeyStoreException, KeyManagementException, IOException {
    IndexRequest<MyDocument> indexRequest = new IndexRequest.Builder<MyDocument>().index(indexName2)
            .id(myDocument.getDocId())
            .document(myDocument)
            .build();

    log.info("indexRequest: {}", javaToJson(indexRequest));

    OpenSearchClient openSearchClient = getOpenSearchClient();
    IndexResponse indexResponse = openSearchClient.index(indexRequest);
    return indexResponse;
}

adding documents in bulk

For bulk addition, we need to supply a JSON document containing entries similar to the following snippet:

POST /_bulk
{"index":{"_index":"productindex"}}
{"_class":"..Product","name":"Corgi Toys .. Car",..."manufacturer":"Hornby"}
{"index":{"_index":"productindex"}}
{"_class":"..Product","name":"CLASSIC TOY .. BATTERY"...,"manufacturer":"ccf"}

bulk load data in JSON format

export pwd="elastic:"
curl --user $pwd  -H 'Content-Type: application/x-ndjson' -XPOST 'https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/0/_bulk?pretty' --data-binary @<file>

fetching

https://opensearch.org/docs/latest/search-plugins/

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

Return All Documents From An Index

I saw scenarios where a document would not show up in the results of this query - especially, immediately after the document is added (from a java application). In those scenarios, using fine-tuned queries using specific search criteria will help.

By default, in the query workbench, this seems to be returning only 10 results. If we want to see the rest of them, we have to use specific criteria in the requests.

GET /wikimedia/_search
{
    "query": {
        "match_all": {}
    }
}

list all docs in index using curl

curl -X GET "https://{YOUR_SERVER}:9200/{YOUR_INDEX}_search" -H 'Content-Type: application/json' -d

curl -X GET 'http://localhost:9200/sample/_search'

Return Documents matching a certain criteria

We send a query of type match for fetching documents matching the string “blue sky”. We can specify queries for searching documents in multiple ways. Elasticsearch provides a JSON based Query DSL (Domain Specific Language) to define queries.

You can query using parameters on the URL. But you can also use JSON, as shown in the next example. JSON would be easier to read and debug when you have a complex query than one giant string of URL parameters.

Query DSL

GET /messages/search {
    "query": {
        "match": {
            "message": "blue sky"
        }
    }
}
GET /messages/search {
    "query": {
        "match": {
            "_id": "bbe13bfa-be84"
        }
    }
}
GET /messages/search {
    "query": {
        "match": {
            "fields.my_custom_field_on_the_object": "bbe13bfa_be84"
        }
    }
}
curl -XGET --header 'Content-Type: application/json' http://localhost:9200/samples/_search -d '{
    "query" : {
        "match" : {
            "school": "Harvard"
        }
    }
}'

Using curl, we use Lucene query format to write q=school:Harvard.

curl -X GET http://localhost:9200/samples/_search?q=school:Harvard

Return only certain fields

To return only certain fields put them into the _source array:

GET filebeat-7.6.2-2020.05.05-000001/_search {
    "_source": ["suricata.eve.timestamp","source.geo.region_name","event.created"],
    "query":      {
        "match" : { "source.geo.country_iso_code": "GR" }
    }
}

Query by a date range

When the field is of type date you can use date math, like this:

GET filebeat-7.6.2-2020.05.05-000001/_search {
    "query": {
        "range" : {
            "event.created": {
                "gte" : "now-7d/d"
            }
        }
    }
}

Putting it all together to form a slightly complex query

GET /indexName2/_search
{
      "from":0,
      "size":10000,
      "_source": ["docName","userName","createdOn","status"],
      "sort": [
        {
          "createdOn": {
            "order": "desc"
          }
        }
      ],
      "query": {
              "term": {
                "userName.keyword": {
                  "value": "superman1"
                }
              }
      }
}

delete index

Below the index is named samples.

curl -X DELETE 'http://localhost:9200/samples'

delete documents in an index

POST /name-of-the-index/_delete_by_query {
    "query": {
        "match": {
            "_id": "bbe13bfa-be84"
        }
    }
}
POST /name-of-the-index/_delete_by_query {
    "query": {
        "match": {
            "fields.my_custom_field_on_the_object": "bbe13bfa_be84"
        }
    }
}

list all indexes

GET _cat/indices

Using curl

curl -X GET 'http://localhost:9200/_cat/indices?v'

list index mapping

All Elasticsearch fields are indexes. So this lists all fields and their types in an index.

curl -X GET http://localhost:9200/samples

update Doc

Update one of the documents in the index with a specific value for a specific field in the document

POST /name_of_the_index/_update/my_custom_field_on_the_object {
    "doc" : {
        "fields" : {
            "another_field_in_the_pojo": "a_new_updated_value"
        }
    }
}

After running this, query the index for the documents again to double-check if the value of the other field got updated.

GET /name_of_the_index/search {
    "query": {
        "match": {
            "fields.my_custom_field_on_the_object": "bbe13bfa_be84"
        }
    }
}

update by query

https://opensearch.org/docs/latest/api-reference/document-apis/update-by-query/

See https://github.com/explorer436/programming-playground/blob/main/java-playground/elasticsearch-examples/elasticsearch-java-api-client/src/main/java/com/example/elasticsearchjavaapiclient/connector/ClientConnector.java#L99

POST test-index1/_update_by_query
{
  "query": {
    "term": {
      "oldValue": 10
    }
  },
  "script" : {
    "source": "ctx._source.oldValue += params.newValue",
    "lang": "painless",
    "params" : {
      "newValue" : 20
    }
  }
}

Add fields to an existing document

First we create a new one. Then we update it.

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/2 -d ' {
    "school": "Clemson"
}'

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/2/_update -d '{
    "doc" : {
        "students": 50000
    }
}'

backup index

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/_reindex -d '{
    "source": {
        "index": "samples"
    },
    "dest": {
        "index": "samples_backup"
    }
}'

show cluster health

curl --user $pwd  -H 'Content-Type: application/json' -XGET https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/_cluster/health?pretty

Tags

  1. Enterprise Search Engines - Aggregations

Reading material

Painless scripting language

  1. https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-walkthrough.html

  2. https://opensearch.org/docs/latest/api-reference/script-apis/exec-script/

  3. https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-using.html


Links to this note