写点什么

Elasticsearch 分词器

用户头像
escray
关注
发布于: 2021 年 02 月 12 日
Elasticsearch 分词器

Elasticsearch 分词器,文字内容来自 B 站中华石杉 Elasticsearch 高手进阶课程,英文内容来自官方文档。

什么是分词器  tokenization


切分词语,normalization(提升 recall 召回率),给你一段句子,然后将这段句子拆分成一个一个的单个的单词,同时对每个单词进行 normalization(时态转换,单复数转换)


recall,召回率:搜索的时候,增加能够搜索到的结果的数量


character filter:在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤 html 标签(<span>hello<span> --> hello),& --> and(I&you --> I and you)


tokenizer:分词,hello you and me --> hello, you, and, me


token filter:lowercase,stop word,synonymom,dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 干掉,mother --> mom,small --> little


一个分词器,很重要,将一段文本进行各种处理,最后处理好的结果才会拿去建立倒排索引

内置分词器的介绍


Set the shape to semi-transparent by calling set_trans(5)


standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默认的是 standard)


simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, trans


whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)


language analyzer(特定的语言的分词器,比如说,english,英语分词器):set, shape, semi, transpar, call, set_tran, 5

Text analysis


Text analysis is the process of converting unstructured text, like the body of an email or a product description, into a structured format that's optimized for search.


Elasticsearch performs text analysis when indexing or searching text fields.


Tokenization: breaking a text down into smaller chunks, called tokens.


An analyzer: character filters, tokenizers, and token filters.


  • character filters: receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. An analyzer may have zero or more character filters, which are applied in order.

  • Tokenizer: A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. An analyzer must have exactly one tokenizer.

  • Token filters: receives the token stream and may add, remove, or change tokens(lowercase, stop, synonym). )An analyzer my have zero or more token filters, which are applied in order.


Stemming is the process of reducing a word to its root form... is language-dependent but often involves removing prefixes and suffixes from words.


中文似乎没有词根、前缀、后缀这些概念。


By default, Elasticsearch uses the standard analyzer for all text analysis.


发布于: 2021 年 02 月 12 日阅读数: 17
用户头像

escray

关注

Let's Go 2017.11.19 加入

在学 Elasticsearch 的项目经理

评论

发布
暂无评论
Elasticsearch 分词器