Mr.好好吃的資料遊樂園: [Elasticsearch] 基本語法

Elasticsearch
本篇介紹ES常用語法(增刪改查)

時間查詢請參考這裡

簡短查詢語法

查詢語法參數

q: 指定查詢語句 EX:q=aa or q=user:aa

df: 特定欄位, 搭配q使用

sort: 排序 asc desc

timeout: 超時, 默認不超時

from, size: 分頁使用

進階查詢:

正規表達式 regex
模糊匹配 fuzzy query name:john~1(與john差一個字都查出)
近似度查詢 proximity search (term進行差異比較)

GET test-charles/_search?pretty //在瀏覽器上json漂亮的顯示

GET test-charles/_search?q=hello_world

GET test-charles/_search?q=body:hello_world //body欄位裡面

GET test-charles/_search?q=hello_world&df=body //同上
GET test-charles/_search?_source=list,body //只搜尋特定欄位

GET test-charles/_search?q="hello world" //phase查詢

GET test-charles/_search?q=hello world //hello or world兩個字都會查, term查詢

GET test-charles/_search?q=job:(+java -python) // job欄位包含java不含python

GET test-charles/_search?q=age:[0 TO 28]

GET test-charles/_search?q=age:(>=1 AND <=24)

GET test-charles/_search?q=body:(hello_*)

GET test-charles/_search?q=hello_word
{
  "profile":true //列出匹配過程
}

修改語法

增刪改

POST cip-charles2/_search # 使用post
{
  "query": {
    "simple_query_string": {
      "query": "c",
      "fields": ["list"],
      "default_operator": "AND"
    }
  }
}POST test-charles/_doc/2 # 要修改直接覆蓋
{
  "body": "hello_word 2",
  "list": ["a","b","c"]
}DELETE test-charles/_doc/2# 批次插入POST mytest/_bulk

# 創建一筆數據
{ "create" : { "_id": 1 } }
{ "color": "create black" }

# 創建一筆數據，因為id=1的文檔已經存在，所以會創建失敗
{ "create" : { "_id": 1 } }
{ "color": "create black2" }

# 索引一筆數據
{ "index" : { "_id": 2 } }
{ "color": "index red" }

# 索引一筆數據，但是index可以創建也可以更新，所以執行成功
{ "index" : { "_id": 2 } }
{ "color": "index red2" }

# 索引一筆數據，不一定要設置id(index又能創建又能更新又不用設id，超好用)
{ "index": {} }
{ "color": "index blue" }

# 刪除一筆文檔，注意delete後面不接一個doc
{ "delete" : { "_id": "2" } }

# 找不到此id的文檔，刪除失敗
{ "delete" : { "_id": "2" } }

# 更新一筆文檔，注意doc格式不太一樣
{ "update" : { "_id": 1 } }
{ "doc": { "color": "update green"} }

# 更新一筆文檔，但因為此id的文檔不存在，所以更新失敗
{ "update" : { "_id": 100 } }
{ "doc": { "color": "update green2"} }

bulk 的請求模板

分成 action 和 metadata 兩部份
action : 必須是以下 4 種選項之一
- index(最常用） : 如果文檔不存在就創建他，如果文檔存在就更新他
- create : 如果文檔不存在就創建他，但如果文檔存在就返回錯誤
  - 使用時一定要在 metadata 設置 _id 值，他才能去判斷這個文檔是否存在
- update : 更新一個文檔，如果文檔不存在就返回錯誤
  - 使用時也要給 _id 值，且後面文檔的格式和其他人不一樣
- delete : 刪除一個文檔，如果要刪除的文檔 id 不存在，就返回錯誤
  - 使用時也必須在 metadata 中設置文檔 _id，且後面不能帶一個 doc，因為沒意義，他是用 _id 去刪除文檔的

mapping - 定義schema

定義數據庫中結構

定義index下的field name
定義字段的類型(int , string, bool)
定義倒排索引配置

簡單類型

Text / Keyword
Date
Integer / Floating
Boolean
IPv4 & IPv6

複雜類型

Object
List

特殊類型

geo_point & geo_shape (地理訊息)
percolator

GET test-charles/_mapping
PUT test-charles
{
    "mappings" : {
      "dynamic": true, #預設true, false新增也查不到, strict不可增加新欄位(會error)
      "properties" : {
        "firstName" : {
          "type" : "text"
        },
        "lastName" : {
          "type" : "text"
        },
        # 將 index 設定為false，ES就不會索引該 field 的資料
        "mobile" : {
          "type" : "text",
          "index": false
        }
      }
    }
}

Request Search查詢語法

直接使用查詢語法會有其限制, 使用request search會更清楚

term query 用於數據化結構的查詢
Full text 使用 match 查詢
bool query 是種複合查詢，可以結合 term query & full text query

Term Query & Full Text Query

Term Query 不會做分詞處理，而 Full Text Query 會做分詞處理
要做精準搜尋，使用 [FIELD_NAME].keyword 欄位
透過 Constant Score query(將關鍵字 term 改成 constant_score + filter + term)，可以將查詢轉換為 filter，跳過算分(scoring)步驟並可利用 cache 來加速查詢效能
若是要快速將不需要的資料過濾掉，constant score query 是很好的一個方式

Term Query

Term 是表達語意的最小單位，搜尋或是利用自然語言進行處理時都需要處理 term
查詢的語法中，只要指定 term(query >> term) 關鍵字，就表示使用 term query
Term Level Query 包含 Term Query / Range Query / Exists Query / Prefix Query / Wildcard Query
當 ES 接收到 term query 時，對輸入不會進行分詞，而是將輸入當作一個整體在 inverted index 中找到精確的詞項，並使用相關度算分的機制來計算分數
若是不需要算分，則可以利用 constant score 將查詢轉換成 filter，並利用 cache 來提高效能

以下使用範例說明 term 查詢所需要注意的事項：

//新增 index
PUT products
{
  "settings": {
    "number_of_shards": 1
  }
}

//新增三筆資料
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }

//檢視 index 的 setting & mapping 資訊
GET /products

//搜尋 "iPhone" => 找不到資料，因為預設的 index analyzer 分詞後會將每個單字轉小寫；
//但 term query 不會經過 analyzer，因此查詢時沒有轉小寫
//搜尋 "iphone" => 改成小寫，順利找到資料
POST /products/_search
{
  "query": {
    "term": {
      "desc": {
        "value": "iPhone"
        //"value":"iphone"
      }
    }
  }
}

//改成在 "desc.keyword" field 含有大寫字母的 "iPohne"，可以找到資料
//因為 keyword field 會完整保留原始資料
//反而使用小寫 "iphone" 是無法找到資料的
POST /products/_search
{
  "query": {
    "term": {
      "desc.keyword": {
        "value": "iPhone"
        //"value":"iphone"
      }
    }
  }
}

//跟上面的範例相同，大寫的搜尋條件無法找到資料
//因為預設的 analyzer 會將內容根據 `-` 切開後再作小寫處理
POST /products/_search
{
  "query": {
    "term": {
      "productID": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}

//使用 keyword field 就可以正確使用大寫的搜尋條件找到資料了
POST /products/_search
{
  //"explain": true,
  "query": {
    "term": {
      "productID.keyword": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}

//透過 "constant_score" 將查詢轉換成 filter 來提昇查詢效率
//因為算分過程被忽略，所有搜尋結果都是 1 分
POST /products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "productID.keyword": "XHDK-A-1293-#fJ3"
        }
      }

    }
  }
}

Full Text Query

Full text query 可以是 match query, match phrase query, query string query … 等等，詳細的列表可以參考官網資料
index & query 都會進行分詞，查詢字串會先傳給一個合適的 analyzer，並生成用來查詢的 term list
分詞後的 term list 會被逐一拿來查詢，並將最後結果合併後，為每個 document 計算出一個分數
slop指的是兩個Term的位置之間允許的最大間隔距離，當slop=1時，兩個term之間可以允許間隔一個單詞

以下是幾個簡單範例：

//若只有 query，預設的 operator 是 or，因此會找到多筆資料
//搭配 operator or minimum_should_match，可以讓查詢結果更準確
POST /movies/_search
{
  "query": {
    "match": {
      "title": {
        //"operator": "and", 
        //"minimum_should_match": 2,
        "query": "Matrix reloaded"
      }
    }
  }
}

//也可以使用 "match_phrase" 搭配 "slop" 讓搜尋結果更準確
POST /movies/_search
{
  "query": {
    "match_phrase": {
      "title": {
        //"slop": 1,
        "query": "Matrix reloaded"
      }
    }
  }
}

結構化數據搜尋

像是 date, boolean, number … 這一類的數據都是屬於結構化的
有些 text 也屬於結構化的，例如：顏色(red, green, blue)、tag(distributed, search)、特定編碼….只要有遵守規定產生的 text，都可以算是結構化的格式

而結構化搜尋就是對結構化的數據進行搜尋

結構化的 text 可以做精確比對(term query) or 部份比對(prefix query)
結構化的搜尋結果只會有 “true” or “false” 兩種值，並且可以根據需求來決定是否做 scoring 的行為
Range
gt
(Optional) Greater than.
gte
(Optional) Greater than or equal to.
lt
(Optional) Less than.
lte
(Optional) Less than or equal to.

結構化搜索，精確匹配

DELETE products

//加入資料
//並不是所有資料都有 date 欄位
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

//查詢 dynamic mapping 後的結果
GET products/_mapping

//進行 term query，並計算出分數
POST products/_search
{
  "profile": "true",
  "explain": true,
  "query": {
    "term": {
      "avaliable": true
    }
  }
}

//進行 term query，但使用 filter context，因此不算分
POST products/_search
{
  "profile": "true",
  "explain": true,
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "avaliable": true
        }
      }
    }
  }
}

//使用 range 查詢 & 搭配 filter context 跳過算分步驟
GET products/_search
{
  "query" : {
    "constant_score" : {
      "filter" : {
        "range" : {
          "price" : {
            "gte" : 20,
            "lte"  : 30
          }
        }
      }
    }
  }
}

//使用特殊字元(y->年)處理日期相關查詢
POST products/_search
{
  "query" : {
    "constant_score" : {
      "filter" : {
        "range" : {
          "date" : {
            "gte" : "now-2y"
          }
        }
      }
    }
  }
}


// =========== multi-value field 的處理 ===========

//若是 field 中包含多個 text
POST /movies/_bulk
{ "index": { "_id": 1 }}
{ "title" : "Father of the Bridge Part II","year":1995, "genre":"Comedy"}
{ "index": { "_id": 2 }}
{ "title" : "Dave","year":1993,"genre":["Comedy","Romance"] }

//由於 genere 為 multi-value field
//因此以下搜尋會將 genre 中有包涵 Comedy 的資料全部顯示出來
//若希望有完全精準的比對，則必須額外加上一個 count 欄位，搭配 boolean query 來完成
POST /movies/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "genre.keyword": "Comedy"
        }
      }
    }
  }
}

包含某些字

GET cip-charles2/_search
{
  "query": {
    "query_string" : {
      "default_field" : "body", 
      "query" : "*hllo*"
     }
  } 
}

多字段完成匹配相同關鍵字

GET /_search
{
  "query": {
    "simple_query_string" : {
        "query": "java AND (python OR ruby)",
        "fields": ["job"]
    }
  }
}

GET /_search
{
  "query": {
    "simple_query_string" : {
        "query": "\"fried eggs\" +(eggplant | potato) -frittata", (1)
        "fields": ["title^5", "body"], (2)
        "default_operator": "and" (3)
    }
  }
}

(1): 必填，即 query string
(2): 選填，欲搜尋的 fields (可多個)，其中 ^5，代表著 title 權重是 body 的5倍
(3): 選填，在 query string 中，如果沒有指定 operators ，則會使用此 default_operator 設定的operator， default_operator 可填的值有 OR 和 AND， default_operator 預設值為 OR

simple_query_string 支援的operators有：

+ 代表 AND 動作
| 代表 OR 動作
- 代表 NOT 否定
" 包起來的字串代表要 phrase query
* 放在term之後代表 prefix query
( and ) 表示優先度較高
~N 放在字詞之後代表 edit distance (fuzziness)
~N 放在 phrase 之後代表 slop amount

Bool Query

filter只過濾符合條件的文檔，不包含相關性分數
must必須符合must所有條件，會影響相關性分數，與filter結果一樣, 但有分數會影響排序
must_not必須不符合must_not所有條件，語句不會影響評分；它的作用只是將不相關的文檔排除。
should文檔不必包含brown或dog這兩個詞項，但如果一旦包含，我們就認為它們更相關
一個 bool query 是一個 or 多個查詢子句所組成 (可組合成複合查詢)

bool query 中的每一個查詢子句得到的分數都會被合併成總和的相關性評分

子查詢可以以任意的順序出現

可用 list 的方式在一個子查詢中加入多個查詢

若 bool query 中沒有 must 條件，那 should

子句	效果
`must`	(Query Context) 必須符合，對算分有貢獻
`should`	(Query Context) 選擇性符合，對算分有貢獻
`must_not`	(Filter Context) 必須不能符合，對算分無貢獻
`filter`	(Filter Context) 必須符合，對算分無貢獻

bool查詢會為每個文檔計算相關度評分_score，再將所有匹配的must和should語句的分數_score求和，最後除以must和should語句的總數。

相關度評分: 文檔與查詢語句間的相關程度, 通過倒排索引可以獲取與語句相匹配的文檔列表

包含多個子查詢的 Bool Query

minimum_should_match: 控制最小命中數, should中至少要命中幾個

//新增多筆資料
POST /products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10,"avaliable":true,"date":"2018-01-01", "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20,"avaliable":true,"date":"2019-01-01", "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30,"avaliable":true, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30,"avaliable":false, "productID" : "QQPX-R-3956-#aD8" }

//bool query，使用多個子查詢
POST /products/_search
{
  "query": {
    "bool" : {
      //Query Context
      "must" : {
        "term" : { "price" : "30" }
      },
      //Filter Context
      "filter": {
        "term" : { "avaliable" : "true" }
      },
      //Filter Context
      "must_not" : {
        "range" : {
          "price" : { "lte" : 10 }
        }
      },
      //Query Context
      "should" : [
        { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
        { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
      ],
      "minimum_should_match" :1
    }
  }
}

複雜的 Bool Query

在 bool query 中再放入一個 bool query：(多層 bool query 的概念)

POST /products/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {
          "price": "30"
        }
      },
      //在 bool query 的子查詢中再塞進一個 bool query
      "should": [
        {
          "bool": {
            "must_not": {
              "term": {
                "avaliable": "false"
              }
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

搭配 boosting 進行查詢

DELETE news

//新增多筆資料
POST /news/_bulk
{ "index": { "_id": 1 }}
{ "content":"Apple Mac" }
{ "index": { "_id": 2 }}
{ "content":"Apple iPad" }
{ "index": { "_id": 3 }}
{ "content":"Apple employee like Apple Pie and Apple Juice" }

//若是想要搜尋的是與 apple computer 相關的資訊
//最後一筆與食物相關的訊息會變成搜尋結果第一筆，因為 apple 出現次數最多
POST /news/_search
{
  "query": {
    "bool": {
      "must": {
        "match":{"content":"apple"}
      }
    }
  }
}

//限制食物相關訊息不能出現
POST /news/_search
{
  "query": {
    "bool": {
      "must": {
        "match":{"content":"apple"}
      },
      "must_not": {
        "match":{"content":"pie"}
      }
    }
  }
}

//希望 apple 關鍵字的訊息都出現
//但透過 boosting 的設定讓食物相關訊息分數較低
//只要有 match 'pie' 的搜尋結果，算分會變更低
POST /news/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": "apple"
        }
      },
      "negative": {
        "match": {
          "content": "pie"
        }
      },
      "negative_boost": 0.5
    }
  }
}

Ref:

Mr.好好吃的資料遊樂園

網頁

2021年2月5日星期五

[Elasticsearch] 基本語法

簡單類型

複雜類型

特殊類型

Term Query

Full Text Query

結構化數據搜尋

Bool Query

複雜的 Bool Query

【尚硅谷】ElasticSearch-LogStash-Kibana全套视频

ElasticSearch - 批量操作 bulk

沒有留言:

張貼留言

熱門文章

網頁

2021年2月5日 星期五

[Elasticsearch] 基本語法

簡單類型

複雜類型

特殊類型

Term Query

Full Text Query

結構化數據搜尋

Bool Query

複雜的 Bool Query

【尚硅谷】ElasticSearch-LogStash-Kibana全套视频

ElasticSearch - 批量操作 bulk

沒有留言:

張貼留言

2021年2月5日星期五