写点什么

Elasticsearch Analyzer 分词器

用户头像
escray
关注
发布于: 2021 年 03 月 05 日
Elasticsearch Analyzer 分词器

Elasticsearch Analyzer 分词器,包括默认分词器和定制分词器,内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]

默认分词器


The standard analyzer, which is the default analyzer used for full-text field, is a good choice for most Western language.


  • standard tokenizer:以单词边界进行切分 splits the input text on word boundaries

  • standard token filter:什么都不做 intended to tidy up the token emitted by the tokenizer (but currently does nothing)

  • lowercase token filter:将所有字母转换为小写 converts all tokens into lowercase

  • stop token filer(默认被禁用):移除停用词,比如 a the it 等等 removes stopwords - common words that have little impact on search relevance, such as a, the, and, is.


By default, the stopwords filter is disabled.


修改分词器的设置


创建了一个新的 analyzer 叫做 es_std analyzer,启用 spanish 停用词 token filter


PUT /spanish_docs{  "settings": {    "number_of_shards": 1,    "number_of_replicas": 0,    "analysis": {      "analyzer": {        "es_std": {          "type": "standard",          "stopwords": "_spanish_"        }      }    }  }}
复制代码


The esstd analyzer is not global - it exists only in the spanishdocs index where we have defined it.


The abbreviated results show that the Spanish stopword El has been removed correly:


GET /spanish_docs/_analyze{  "analyzer": "es_std",  "text": "El veloz zorro marrón"}
{ "tokens" : [ { "token" : "veloz", "position" : 1 }, { "token" : "zorro", "position" : 2 }, { "token" : "marrón", "position" : 3 } ]}
复制代码
定制化分词器 Custom Analyzers


  • Character filters: An analyzer may have zero or more character filters. tidy up a string before it is tokenized. html_strip character filter use to remove all HTML tags and to convert HTML entities like Á into the corresponding Unicode character Á.


  • Tokenizers: An analyzer must have a single tokenizer. The Tokenizer breaks up the string into individual terms or tokens.

  • standard tokenizer

  • keyword tokenizer: outputs exactly the same string as it received, without any tokenization

  • whitespace tokenizer

  • the pattern tokenizer: split text on a matching regular expression


  • Token filters: After tokenization, the resulting token stream is passed through any specified token filters, in the order in which they are specified. Token filters may change, add, or remove tokens.

  • lowercase

  • stop token filters

  • Stemming token filters: "stem" words to their root form

  • ascii_folding filter: removes diacritics, converting a term like "très" into "tres".

  • ngram

  • edge_ngram token filters


定制分词器语法


PUT /my_index{  "settings": {    "analysis": {      "char_filter": { ... custom character filters ...},      "tokenizer": { ... custom tokenizers ...},      "filter": { ... custom token filters ...},      "analyzer": { ... custom analyzers ... }    }  }}
复制代码


示例:


PUT /my_index{  "settings": {    "analysis": {      "char_filter": {        "&_to_and": {          "type": "mapping",          "mappings": [ "&=> and "]        }      },      "filter": {        "my_stopwords": {          "type": "stop",          "stopwords": ["the", "a"]        }      },      "analyzer": {        "my_analyzer": {          "type": "custom",          "char_filter": ["html_strip", "&_to_and"],          "tokenizer": "standard",          "filter": ["lowercase", "my_stopwords"]        }      }    }  }}
GET /my_index/_analyze{ "text": "The quick & brown fox", "analyzer": "my_analyzer"}
{ "tokens" : [ { "token" : "quick", "position" : 1 }, { "token" : "and", "position" : 2 }, { "token" : "brown", "position" : 3 }, { "token" : "fox", "position" : 4 } ]}
复制代码


The analyzer is not much use unless we tell Elasticsearch where to use it.


PUT /my_index/_mapping/my_type{  "properties": {    "title": {      "type": "string",      "analyzer": "my_analyzer"    }  }}
复制代码


因为版本升级,报错了。需要修改两个地方:


  • Types cannot be provided in put mapping requests, unless the include_type_name parameter is set to true.

  • No handler for type [string] declared on field [title],string 类型已经没有了。


PUT /my_index/_mapping{  "properties": {    "title": {      "type": "text",      "analyzer": "my_analyzer"    }  }}
复制代码


发布于: 2021 年 03 月 05 日阅读数: 21
用户头像

escray

关注

Let's Go 2017.11.19 加入

在学 Elasticsearch 的项目经理

评论

发布
暂无评论
Elasticsearch Analyzer 分词器