第六节 ElasticSearch Mapping 介绍

1、Dynamic Mapping和常见字段类型

本节知识点

Mapping&Dynamic Mapping
字段类型自动识别
控制Mapping的Dynamic属性

1-1 什么是Mapping

Mapping类似数据库中的schema的定义，作用如下
- 定义索引中的字段的名称
- 定义字段的数据类型，例如字符串，数字，布尔……
- 字段，倒排索引的相关配置，(Analyzed or Not Analyzed Analyzer)
Mapping会把JSON文档映射成Lucene所需要的扁平格式
一个Mapping属于一个索引的Type
- 每个文档都属于一个Type
- 一个Type有一个Mapping定义
- 7.0开始，不需要在Mapping定义中指定type信息

Mapping中的字段一旦设定后，禁止直接修改。因为倒排索引生成后不允许直接修改。需要重新建立新的索引，做reindex操作。

类似数据库中的表结构定义，主要作用

定义所以下的字段名字
定义字段的类型
定义倒排索引相关的配置（是否被索引？采用的Analyzer）

对新增字段的处理 true false strict

在object下，支持做dynamic的属性的定义

1-2 字段的数据类型

简单类型
- Text/Keyword
- Date
- Integer/Floating
- Boolean
- IPv4&IPv6
复杂类型一对象和嵌套对象
- 对象类型／嵌套类型
特殊类型
- geo_point&geo_shape/percolator

1-3 什么是Dynamic Mapping

在写入文档时候，如果索引不存在，会自动创建索引
Dynamic Mapping的机制，使得我们无需手动定义Mappings
Elasticsearch会自动根据文档信息，推算出字段的类型
但是有时候会推算的不对，例如地理位置信息
当类型如果设置不对时，会导致一些功能无法正常运行，例如 Range查询

Alt Image Text

1-4 类型的自动识别

Alt Image Text

Demo

写入文档，看Elasticsearch会如何推导数据类型
- 数字用引号，默认当下TEXT
- 日期格式会推导成Date
有一些类型会推导出错，例如地理位置信息。

写入文档，查看 Mapping

#写入文档，查看 Mapping
PUT mapping_test/_doc/1
{
  "firstName":"Chan",
  "lastName": "Jackie",
  "loginDate":"2018-07-24T10:29:48.103Z"
}

Output

{
  "_index" : "mapping_test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

查看 Mapping文件

#查看 Mapping文件
GET mapping_test/_mapping

Output

{
  "mapping_test" : {
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "loginDate" : {
          "type" : "date"
        }
      }
    }
  }
}

Delete index

DELETE mapping_test

{
  "acknowledged" : true
}

dynamic mapping，推断字段的类型

PUT mapping_test/_doc/1
{
    "uid" : "123",
    "isVip" : false,
    "isAdmin": "true",
    "age":19,
    "heigh":180
}

{
  "_index" : "mapping_test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

#查看 Dynamic
GET mapping_test/_mapping

{
  "mapping_test" : {
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"
        },
        "heigh" : {
          "type" : "long"
        },
        "isAdmin" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "isVip" : {
          "type" : "boolean"
        },
        "uid" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

1-5 能否更改Mapping的字段类型

两种情况

新增加字段
- Dynamic设为true时，一旦有新增字段的文档写入，Mapping也同时被更新
- Dynamic设为false, Mapping不会被更新，新增字段的数据无法被索引 ,但是信息会出现在_source中
- Dynamic设置成Strict，文档写入失败
对已有字段，一旦已经有数据写入，就不再支持修改字段定义
- Lucene实现的倒排索引，一旦生成后，就不允许修改
如果希望改变字段类型，必须Reindex API，重建索引

原因

如果修改了字段的数据类型，会导致已被索引的属于无法被搜索
但是如果是增加新的字段，就不会有这样的影响

1-6 控制Dynamic Mappings

当dynamic被设置成false时候，存在新增字段的数据写入，该数据可以被索引但是新增字段被丢弃
当设置成Strict模式时候，数据写入直接出错

Alt Image Text

默认Mapping支持dynamic，写入的文档中加入新的字段

PUT dynamic_mapping_test/_doc/1
{
  "newField":"someValue"
}

GET dynamic_mapping_test/_mapping

{
  "dynamic_mapping_test" : {
    "mappings" : {
      "properties" : {
        "newField" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

该字段可以被搜索，数据也在`_source`中出现

POST dynamic_mapping_test/_search
{
  "query":{
    "match":{
      "newField":"someValue"
    }
  }
}

Output

"hits" : [
      {
        "_index" : "dynamic_mapping_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "newField" : "someValue"
        }
      }
    ]

修改为dynamic false

PUT dynamic_mapping_test/_mapping
{
  "dynamic": false
}

{
  "acknowledged" : true
}

新增 anotherField

PUT dynamic_mapping_test/_doc/10
{
  "anotherField":"someValue"
}

"dynamic": false: 可以插入，无法被搜索到

POST dynamic_mapping_test/_search
{
  "query":{
    "match":{
      "anotherField":"someValue"
    }
  }
}

该字段不可以被搜索，因为dynamic已经被设置为false

{
  "took" : 168,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

GET dynamic_mapping_test/_doc/10

{
  "_index" : "dynamic_mapping_test",
  "_type" : "_doc",
  "_id" : "10",
  "_version" : 1,
  "_seq_no" : 1,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "anotherField" : "someValue"
  }
}

修改为strict

PUT dynamic_mapping_test/_mapping
{
  "dynamic": "strict"
}

{
  "acknowledged" : true
}

写入数据出错，HTTP Code 400

PUT dynamic_mapping_test/_doc/12
{
  "lastField":"value"
}

{
  "error" : {
    "root_cause" : [
      {
        "type" : "strict_dynamic_mapping_exception",
        "reason" : "mapping set to strict, dynamic introduction of [lastField] within [_doc] is not allowed"
      }
    ],
    "type" : "strict_dynamic_mapping_exception",
    "reason" : "mapping set to strict, dynamic introduction of [lastField] within [_doc] is not allowed"
  },
  "status" : 400
}

GET dynamic_mapping_test/_doc/12

{
  "_index" : "dynamic_mapping_test",
  "_type" : "_doc",
  "_id" : "12",
  "found" : false
}

2、显式Mapping设置与常见参数介绍

2-1 如何显示定义一个Mapping

Alt Image Text

2-2 自定义Mapping的一些建议

可以参考API手册，纯手写
为了减少输入的工作量，减少出错概率，可以依照以下步骤
- 创建一个临时的index，写入一些样本数据
- 通过访问Mapping API获得该临时文件的动态Mapping定义
- 修改后用，使用该配置创建你的索引
- 删除临时索引

#设置 Mobile index 为 false
DELETE users


PUT users
{
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text"
        },
        "lastName" : {
          "type" : "text"
        },
        "mobile" : {
          "type" : "text",
          "index": false
        }
      }
    }
}

Output :

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "users"
}

PUT users/_doc/1
{
  "firstName":"Jacob",
  "lastName": "Xi",
  "mobile": "12345678"
}

POST /users/_search
{
  "query": {
    "match": {
      "mobile":"12345678"
    }
  }
}

Alt Image Text

2-3 Index Options

四种不同级别的Index Options配置，可以控制倒排索引记录的内容
- docs——记录doc id
- freqs——记录doc id和term frequencies
- positions一记录doc id / term frequencies / term position
- offsets一doc id/term frequencies/term posistion/character offects
Text类型默认记录postions，其他默认为docs
记录内容越多，占用存储空间越大

Alt Image Text

2-4 null_values

需要对Null值实现搜索
只有Keyword类型支持设定null_value

#设定Null_value
DELETE users


PUT users
{
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text"
        },
        "lastName" : {
          "type" : "text"
        },
        "mobile" : {
          "type" : "keyword",
          "null_value": "NULL"
        }

      }
    }
}

Output

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "users"
}

PUT users/_doc/1
{
  "firstName":"Jacob",
  "lastName": "Xi",
  "mobile": null
}

PUT users/_doc/2
{
  "firstName":"Jacob2",
  "lastName": "xi2"

}

GET users/_search
{
  "query": {
    "match": {
      "mobile":"NULL"
    }
  }

}

GET users/_search?q=mobile:NULL

Alt Image Text

2-5 copy_to 设置

_all在7中被copy_to所替代
满足一些特定的搜索需求
copy_to将字段的数值拷贝到目标字段，实现类似_all的作用
copy_to的目标字段不出现在_source中

#设置 Copy to
DELETE users
PUT users
{
  "mappings": {
    "properties": {
      "firstName":{
        "type": "text",
        "copy_to": "fullName"
      },
      "lastName":{
        "type": "text",
        "copy_to": "fullName"
      }
    }
  }
}

PUT users/_doc/1
{
  "firstName":"Jacob",
  "lastName": "Xi"
}

GET users/_search?q=fullName:(Jacob Xi)

POST users/_search
{
  "query": {
    "match": {
       "fullName":{
        "query": "Jacob Xi",
        "operator": "and"
      }
    }
  }
}

Alt Image Text

2-6 数组类型

Elasticsearch中不提供专门的数组类型。但是任何字段，都可以包含多个相同类类型的数值

#数组类型
PUT users/_doc/1
{
  "name":"onebird",
  "interests":"reading"
}

PUT users/_doc/1
{
  "name":"twobirds",
  "interests":["reading","music"]
}

POST users/_search
{
  "query": {
        "match_all": {}
    }
}

Alt Image Text

GET users/_mapping

{
  "users" : {
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text",
          "copy_to" : [
            "fullName"
          ]
        },
        "fullName" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "interests" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastName" : {
          "type" : "text",
          "copy_to" : [
            "fullName"
          ]
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

3、多字段特性及Mapping中配置自定义Analyzer

3-1 多字段类型

多字段特性
- 厂商名字实现精确匹配
  - 增加一个keyword字段
- 使用不同的analyzer
  - 不同语言
  - pinyin字段的搜索
  - 还支持为搜索和索引指定不同的analyzer

Alt Image Text

3-2 Exact Values v.s Full Text

Exact Values v.s Full Text
- Exact Value:包括数字／日期／具体一个字符串（例如"Apple Store")
  - Elasticseach中的keyword
- 全文本，非结构化的文本数据
  - Elasticsearch中的text

Alt Image Text

3-3 Exact Values不需要被分词

Elasticsearch为每一个字段创建一个倒排索引
- Exact Value在索引时，不需要做特殊的分词处理

Alt Image Text

PUT logs/_doc/1
{"level":"DEBUG"}

GET /logs/_mapping

Output

{
  "logs" : {
    "mappings" : {
      "properties" : {
        "level" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

3-4 自定义分词

当Elasticsearch自带的分词器无法满足时，可以自定义分词器。通过自组合不同的组件实现
- Character Filter
- Tokenizer
- Token Filter

3-5 Character Filters

在Tokenizer之前对文本进行处理，例如增加删除及替换字符。可以配置多个Character Filters。会影响Tokenizer的position和offset信息
一些自带的Character Filters
- HTML strip —— 去除html标签
- Mapping —— 字符串替换
- Pattern replace —— 正则匹配替换

3-5-1 HTML strip —— 去除html标签

POST _analyze
{
  "tokenizer":"keyword",
  "char_filter":["html_strip"],
  "text": "<b>hello world</b>"
}

{
  "tokens" : [
    {
      "token" : "hello world",
      "start_offset" : 3,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

3-5-2 Mapping —— 字符串替换

#使用char filter进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

Output :

{
  "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test_990",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "650_555_1234",
      "start_offset" : 26,
      "end_offset" : 38,
      "type" : "<NUM>",
      "position" : 3
    }
  ]
}

//char filter 替换表情符号
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ ":) => happy", ":( => sad"]
      }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
}

Output :

...
 {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "Feeling",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 104
    },
    {
      "token" : "sad",
      "start_offset" : 24,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 105
    },
 ...

felling \ happy \ Feeling \ sad

3-5-3 Pattern replace —— 正则匹配替换

//正则表达式
GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "pattern_replace",
        "pattern" : "http://(.*)",
        "replacement" : "$1"
      }
    ],
    "text" : "http://www.elastic.co"
}

Output

{
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

3-6 Tokenizer

将原始的文本按照一定的规则，切分为词（term or token)
Elasticsearch内置的Tokenizers
- whitespace/standard/uax_url_email/pattern/keyword/path hierarchy
可以用Java开发插件，实现自己的Tokenizer

3-6-1 whitespace

// white space and snowball
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "gilr",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "China",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
...

The \ gilr \ China \ play \ game!

// whitespace与stop
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

The \ rain \ Spain \ falls \ mainly \ plain

3-6-2 `path hierarchy`

POST _analyze
{
  "tokenizer":"path_hierarchy",
  "text":"/user/ymruan/a/b/c/d/e"
}

Alt Image Text

3-7 Token Filters

将Tokenizer输出的单词（term)，进行增加，修改，删除
自带的Token Filters
- Lowercase/stop/synonym（添加近义词）

//remove 加入lowercase后，The被当成 stopword删除
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}

gilr \ china \ play \ game

3-8 设置一个Custom Analyzer

Alt Image Text

第六节 ElasticSearch Mapping 介绍

1、Dynamic Mapping和常见字段类型

1-1 什么是Mapping

1-2 字段的数据类型

1-3 什么是Dynamic Mapping

1-4 类型的自动识别

写入文档，查看 Mapping

查看 Mapping文件

Delete index

dynamic mapping，推断字段的类型

1-5 能否更改Mapping的字段类型

1-6 控制Dynamic Mappings

默认Mapping支持dynamic，写入的文档中加入新的字段

该字段可以被搜索，数据也在_source中出现

修改为dynamic false

新增 anotherField

该字段不可以被搜索，因为dynamic已经被设置为false

修改为strict

写入数据出错，HTTP Code 400

2、显式Mapping设置与常见参数介绍

2-1 如何显示定义一个Mapping

2-2 自定义Mapping的一些建议

2-3 Index Options

2-4 null_values

2-5 copy_to 设置

2-6 数组类型

3、多字段特性及Mapping中配置自定义Analyzer

3-1 多字段类型

3-2 Exact Values v.s Full Text

3-3 Exact Values不需要被分词

3-4 自定义分词

3-5 Character Filters

3-5-1 HTML strip —— 去除html标签

3-5-2 Mapping —— 字符串替换

3-5-3 Pattern replace —— 正则匹配替换

3-6 Tokenizer

3-6-1 whitespace

3-6-2 path hierarchy

3-7 Token Filters

3-8 设置一个Custom Analyzer

该字段可以被搜索，数据也在`_source`中出现

3-6-2 `path hierarchy`