Elasticsearch Analyzer 分词器
Elasticsearch Analyzer 分词器,包括默认分词器和定制分词器,内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]
默认分词器
The standard analyzer, which is the default analyzer used for full-text field, is a good choice for most Western language.
standard tokenizer:以单词边界进行切分 splits the input text on word boundaries
standard token filter:什么都不做 intended to tidy up the token emitted by the tokenizer (but currently does nothing)
lowercase token filter:将所有字母转换为小写 converts all tokens into lowercase
stop token filer(默认被禁用):移除停用词,比如 a the it 等等 removes stopwords - common words that have little impact on search relevance, such as a, the, and, is.
By default, the stopwords filter is disabled.
修改分词器的设置
创建了一个新的 analyzer 叫做 es_std analyzer,启用 spanish 停用词 token filter
The esstd analyzer is not global - it exists only in the spanishdocs index where we have defined it.
The abbreviated results show that the Spanish stopword El has been removed correly:
定制化分词器 Custom Analyzers
Character filters: An analyzer may have zero or more character filters. tidy up a string before it is tokenized. html_strip character filter use to remove all HTML tags and to convert HTML entities like Á into the corresponding Unicode character Á.
Tokenizers: An analyzer must have a single tokenizer. The Tokenizer breaks up the string into individual terms or tokens.
standard tokenizer
keyword tokenizer: outputs exactly the same string as it received, without any tokenization
whitespace tokenizer
the pattern tokenizer: split text on a matching regular expression
Token filters: After tokenization, the resulting token stream is passed through any specified token filters, in the order in which they are specified. Token filters may change, add, or remove tokens.
lowercase
stop token filters
Stemming token filters: "stem" words to their root form
ascii_folding filter: removes diacritics, converting a term like "très" into "tres".
ngram
edge_ngram token filters
定制分词器语法
示例:
The analyzer is not much use unless we tell Elasticsearch where to use it.
因为版本升级,报错了。需要修改两个地方:
Types cannot be provided in put mapping requests, unless the include_type_name parameter is set to true.
No handler for type [string] declared on field [title],string 类型已经没有了。
版权声明: 本文为 InfoQ 作者【escray】的原创文章。
原文链接:【http://xie.infoq.cn/article/e617dd4ec974f89d5853d0ff9】。
本文遵守【CC-BY 4.0】协议,转载请保留原文出处及本版权声明。
评论