第五节剖析分布式查询及相关性算分

1、分布式搜索的运⾏机制

Elasticsearch 的搜索，会分两阶段进⾏
- 第⼀阶段 - Query
- 第⼆阶段 - Fetch
Query-then-Fetch

2、Query 阶段

⽤户发出搜索请求到 ES 节点。节点收到请求后，会以 Coordinating 节点的身份，在 6 个主副分⽚中随机选择 3 个分片，发送查询请求
被选中的分⽚执⾏查询，进⾏排序。然后，每个分⽚都会返回 From + Size 个排序后的⽂档 Id 和排序值给 Coordinating 节点

Alt Image Text

3、Fetch 阶段

Coordinating Node 会将 Query 阶段，从每个分⽚获取的排序后的文档 Id 列表，重新进⾏排序。选取 From 到 From + Size 个⽂档的 Id
以 multi get 请求的⽅式，到相应的分片获取详细的⽂档数据

Alt Image Text

4、Query Then Fetch 潜在的问题

性能问题
- 每个分⽚上需要查的⽂档个数 = from + size
- 最终协调节点需要处理: number_of_shard * ( from+size )
- 深度分⻚
相关性算分
- 每个分⽚都基于⾃己的分⽚上的数据进⾏相关度计算。这会导致打分偏离的情况，特别是数据量很少时。相关性算分在分⽚之间是相互独立。当⽂档总数很少的情况下，如果主分片⼤于 1，主分⽚数越多，相关性算分会越不准

5、解决算分不准的⽅法

数据量不⼤的时候，可以将主分⽚数设置为 1

当数据量⾜够⼤时候，只要保证⽂档均匀分散在各个分片上，结果一般就不会出现偏差

使⽤ DFS Query Then Fetch

搜索的URL 中指定参数 “_search?search_type=dfs_query_then_fetch”
到每个分⽚把各分⽚的词频和⽂档频率进⾏搜集，然后完整的进⾏⼀次相关性算分，耗费更加多的 CPU 和内存，执⾏性能低下，⼀般不建议使⽤

6、相关性算分问题 Demo

写⼊ 3 条记录 “Good” / “Good morning” / “good morning everyone”
使⽤ 1 个主分⽚测试， Good 应该排在第⼀，Good DF 数值应该是 3
和 20 个主分⽚，测试
当多个主分⽚时，3 个⽂档的算分都⼀样。可以通过 Explain API 进⾏分析
在 3 个主分片上执⾏ DFS Query Then Fetch，结果和⼀个分⽚上⼀致

DELETE message
PUT message
{
  "settings": {
    "number_of_shards": 20
  }
}

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "message"
}

Alt Image Text

GET message

Output

{
  "message" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1603408677309",
        "number_of_shards" : "20",
        "number_of_replicas" : "1",
        "uuid" : "QKfnIdUsR9Gj8giWp7Kr0g",
        "version" : {
          "created" : "7090199"
        },
        "provided_name" : "message"
      }
    }
  }
}

写入数据

POST message/_doc?routing=1
{
  "content":"good"
}

POST message/_doc?routing=2
{
  "content":"good morning"
}

POST message/_doc?routing=3
{
  "content":"good morning everyone"
}

POST message/_search
{
  "explain": true,
  "query": {
    "match_all": {}
  }
}

Ouput

"_shard" : "[message][2]",
"_score" : 1.0,
"_routing" : "3",
"_source" : {
          "content" : "good morning everyone"
        },
...
"_shard" : "[message][14]",
"_score" : 1.0,
"_routing" : "2",
 "_source" : {
          "content" : "good morning"
 },
 ...
"_shard" : "[message][17]",
"_score" : 1.0,
        "_routing" : "1",
        "_source" : {
          "content" : "good"
        },

当⽂档总数很少的情况下，如果主分片⼤于 1，主分⽚数越多，相关性算分会越不准

POST message/_search
{
  "explain": true,
  "query": {
    "term": {
      "content": {
        "value": "good"
      }
    }
  }
}

Ouput :

"_score" : 0.2876821,
"_source" : {
          "content" : "good morning everyone"
        },
...
"_score" : 0.2876821,

"_source" : {
          "content" : "good morning"
        }, 
...
"_score" : 0.2876821,
"_source" : {
          "content" : "good"
        },

使⽤ DFS Query Then Fetch


POST message/_search?search_type=dfs_query_then_fetch
{

  "query": {
    "term": {
      "content": {
        "value": "good"
      }
    }
  }
}

Ouput :

"hits" : [
      {
        "_index" : "message",
        "_type" : "_doc",
        "_id" : "rX2gUnUBaxLQOLM-iUzd",
        "_score" : 0.16786805,
        "_routing" : "1",
        "_source" : {
          "content" : "good"
        }
      },
      {
        "_index" : "message",
        "_type" : "_doc",
        "_id" : "sH2gUnUBaxLQOLM-p0xC",
        "_score" : 0.13353139,
        "_routing" : "2",
        "_source" : {
          "content" : "good morning"
        }
      },
      {
        "_index" : "message",
        "_type" : "_doc",
        "_id" : "tX2gUnUBaxLQOLM-vkyS",
        "_score" : 0.110856235,
        "_routing" : "3",
        "_source" : {
          "content" : "good morning everyone"
        }
      }
    ]

第五节 剖析分布式查询及相关性算分

1、分布式搜索的运⾏机制

2、Query 阶段

3、Fetch 阶段

4、Query Then Fetch 潜在的问题

5、解决算分不准的⽅法

6、相关性算分问题 Demo

第五节剖析分布式查询及相关性算分