Enterprise Search Engines - helpful queries and commands
The operations of Elasticsearch are available as REST APIs. The primary functions are:
- storing documents in an index,
- searching the index with powerful queries to fetch those documents, and
- run analytic functions on the data.
The commands that need to be run to look at data in OpenSearch are in this link: https://opensearch.org/docs/latest/ - Docker quickstart
Examples for performing different operations of Elasticsearch using the available REST APIs. These commands can be run from “Kibana” console.
https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html
Searching data in Amazon OpenSearch Service: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/searching.html
With Basic Authentication
If you have turned on security with ElasticSearch then you need to supply the user and password like shown below to every curl command:
curl -X GET 'http://localhost:9200/_cat/indices?v' -u elastic:(password)
Pretty Print
Add ?pretty=true to any search to pretty print the JSON. Like this:
curl -X GET 'http://localhost:9200/(index)/_search'?pretty=true
Create an index
https://opensearch.org/docs/latest/field-types/
Indexing (Insert data)
adding single document
adding one document at a time, or
The API for adding individual documents accepts a document as a parameter.
PUT /messages/_doc/1 {
"message": "The Sky is blue today"
}
POST /movies/_doc
{
"name" : "The Godfather",
"director" : "Francis Ford Coppola",
"age" : 52,
"year" : "1972"
}
POST /movies/_doc
{
"name" : "Goodfellas",
"director" : "Martin Scorsese",
"age" : 33,
"year" : "1990"
}
Using curl
curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/1 -d '{
"school" : "Harvard"
}'
How does _id get generated for the inserted document?
If we don’t specify a value for the id
when inserting (indexing) a document, a random value is generated and assigned to it.
IndexRequest<MyDocument> indexRequest = new IndexRequest.Builder<MyDocument>().index(indexName2)
.document(myDocument)
.build();
log.info("indexRequest: {}", javaToJson(indexRequest));
OpenSearchClient openSearchClient = getOpenSearchClient();
IndexResponse indexResponse = openSearchClient.index(indexRequest);
return indexResponse.id();
However, if your model has a field that is going to be unique, use it as the value for _id
. This will make updates much easier to deal with.
IndexRequest<MyDocument> indexRequest = new IndexRequest.Builder<MyDocument>().index(indexName2)
.id(myDocument.getDocId())
.document(myDocument)
.build();
log.info("indexRequest: {}", javaToJson(indexRequest));
OpenSearchClient openSearchClient = getOpenSearchClient();
IndexResponse indexResponse = openSearchClient.index(indexRequest);
return indexResponse.id();
With this approach, if we have to update the document, all we have to do it retrieve the document, change the fields that need to be changed, and we can call the save operation again.
If we want to use the random generated _id
, we have to account for it while doing updates. If we don’t use the generated _id
while updating the document, we will see duplicate docs in opensearch/elasticsearch. The reason is, if we don’t specify a _id
at the time of insertion, it is considered a new document by opensearch/elasticsearch.
If we cannot use a field from the model to use as id
, we have to insert a field in it to hold the random generated id. After the insertion, we need to grab it from the indexResponse and we have to save it in the model.
public String insertDocument(MyDocument myDocument) throws IOException, NoSuchAlgorithmException, KeyStoreException, KeyManagementException {
IndexResponse indexResponse = getIndexResponse(myDocument);
if (StringUtils.isEmpty(myDocument.getDocId())) {
myDocument.setDocId(indexResponse.id());
getIndexResponse(myDocument);
}
return indexResponse.id();
}
private IndexResponse getIndexResponse(MyDocument myDocument) throws NoSuchAlgorithmException, KeyStoreException, KeyManagementException, IOException {
IndexRequest<MyDocument> indexRequest = new IndexRequest.Builder<MyDocument>().index(indexName2)
.id(myDocument.getDocId())
.document(myDocument)
.build();
log.info("indexRequest: {}", javaToJson(indexRequest));
OpenSearchClient openSearchClient = getOpenSearchClient();
IndexResponse indexResponse = openSearchClient.index(indexRequest);
return indexResponse;
}
adding documents in bulk
For bulk addition, we need to supply a JSON document containing entries similar to the following snippet:
POST /_bulk
{"index":{"_index":"productindex"}}
{"_class":"..Product","name":"Corgi Toys .. Car",..."manufacturer":"Hornby"}
{"index":{"_index":"productindex"}}
{"_class":"..Product","name":"CLASSIC TOY .. BATTERY"...,"manufacturer":"ccf"}
bulk load data in JSON format
export pwd="elastic:"
curl --user $pwd -H 'Content-Type: application/x-ndjson' -XPOST 'https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/0/_bulk?pretty' --data-binary @<file>
fetching
https://opensearch.org/docs/latest/search-plugins/
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
Return All Documents From An Index
I saw scenarios where a document would not show up in the results of this query - especially, immediately after the document is added (from a java application). In those scenarios, using fine-tuned queries using specific search criteria will help.
By default, in the query workbench, this seems to be returning only 10 results. If we want to see the rest of them, we have to use specific criteria in the requests.
GET /wikimedia/_search
{
"query": {
"match_all": {}
}
}
list all docs in index using curl
curl -X GET "https://{YOUR_SERVER}:9200/{YOUR_INDEX}_search" -H 'Content-Type: application/json' -d
curl -X GET 'http://localhost:9200/sample/_search'
Return Documents matching a certain criteria
We send a query of type match
for fetching documents matching the string “blue sky”. We can specify queries for searching documents in multiple ways. Elasticsearch provides a JSON based Query DSL (Domain Specific Language) to define queries.
You can query using parameters on the URL. But you can also use JSON, as shown in the next example. JSON would be easier to read and debug when you have a complex query than one giant string of URL parameters.
Query DSL
GET /messages/search {
"query": {
"match": {
"message": "blue sky"
}
}
}
GET /messages/search {
"query": {
"match": {
"_id": "bbe13bfa-be84"
}
}
}
GET /messages/search {
"query": {
"match": {
"fields.my_custom_field_on_the_object": "bbe13bfa_be84"
}
}
}
curl -XGET --header 'Content-Type: application/json' http://localhost:9200/samples/_search -d '{
"query" : {
"match" : {
"school": "Harvard"
}
}
}'
Using curl, we use Lucene query format to write q=school:Harvard.
curl -X GET http://localhost:9200/samples/_search?q=school:Harvard
Return only certain fields
To return only certain fields put them into the _source array:
GET filebeat-7.6.2-2020.05.05-000001/_search {
"_source": ["suricata.eve.timestamp","source.geo.region_name","event.created"],
"query": {
"match" : { "source.geo.country_iso_code": "GR" }
}
}
Query by a date range
When the field is of type date you can use date math, like this:
GET filebeat-7.6.2-2020.05.05-000001/_search {
"query": {
"range" : {
"event.created": {
"gte" : "now-7d/d"
}
}
}
}
Putting it all together to form a slightly complex query
GET /indexName2/_search
{
"from":0,
"size":10000,
"_source": ["docName","userName","createdOn","status"],
"sort": [
{
"createdOn": {
"order": "desc"
}
}
],
"query": {
"term": {
"userName.keyword": {
"value": "superman1"
}
}
}
}
delete index
Below the index is named samples.
curl -X DELETE 'http://localhost:9200/samples'
delete documents in an index
POST /name-of-the-index/_delete_by_query {
"query": {
"match": {
"_id": "bbe13bfa-be84"
}
}
}
POST /name-of-the-index/_delete_by_query {
"query": {
"match": {
"fields.my_custom_field_on_the_object": "bbe13bfa_be84"
}
}
}
list all indexes
GET _cat/indices
Using curl
curl -X GET 'http://localhost:9200/_cat/indices?v'
list index mapping
All Elasticsearch fields are indexes. So this lists all fields and their types in an index.
curl -X GET http://localhost:9200/samples
update Doc
Update one of the documents in the index with a specific value for a specific field in the document
POST /name_of_the_index/_update/my_custom_field_on_the_object {
"doc" : {
"fields" : {
"another_field_in_the_pojo": "a_new_updated_value"
}
}
}
After running this, query the index for the documents again to double-check if the value of the other field got updated.
GET /name_of_the_index/search {
"query": {
"match": {
"fields.my_custom_field_on_the_object": "bbe13bfa_be84"
}
}
}
update by query
https://opensearch.org/docs/latest/api-reference/document-apis/update-by-query/
POST test-index1/_update_by_query
{
"query": {
"term": {
"oldValue": 10
}
},
"script" : {
"source": "ctx._source.oldValue += params.newValue",
"lang": "painless",
"params" : {
"newValue" : 20
}
}
}
Add fields to an existing document
First we create a new one. Then we update it.
curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/2 -d ' {
"school": "Clemson"
}'
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/2/_update -d '{
"doc" : {
"students": 50000
}
}'
backup index
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/_reindex -d '{
"source": {
"index": "samples"
},
"dest": {
"index": "samples_backup"
}
}'
show cluster health
curl --user $pwd -H 'Content-Type: application/json' -XGET https://58571402f5464923883e7be42a037917.eu-central-1.aws.cloud.es.io:9243/_cluster/health?pretty