1、默认的分词器
standard
standard tokenizer:以单词为边界进行切分
standard token filter:什么都不做
lowercase token filter:将所有字母转换为小写
stop token filter(默认被禁用):移除停用词,比如a an the it等等
2、修改分词器的设置
需求:上面说了stop token filter默认被禁用。 现在需要启用english的停用词token filter
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"es_std" : {
"type" : "standard",
"stopwords" : "_english_"
}
}
}
}
}
测试分词
(1)先用标准的standard分词查看结果
GET /my_index/_analyze
{
"analyzer": "standard",
"text": "a dog is in the house"
}
返回结果
{
"tokens": [
{
"token": "a",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dog",
"start_offset": 2,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "is",
"start_offset": 6,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "in",
"start_offset": 9,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "the",
"start_offset": 12,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "house",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 5
}
]
}
结果表明没有自动去掉a,the,in等词项。
(2)用我们修改过的分词器
GET /my_index/_analyze
{
"analyzer": "es_std",
"text": "a dog is in the house"
}
结果
{
"tokens": [
{
"token": "dog",
"start_offset": 2,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "house",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 5
}
]
}
结果表明已经去掉了a the in这些。
3、定制化自己的分词器
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": ["&=> and"]
}
},
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": ["the", "a"]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": ["html_strip", "&_to_and"],
"tokenizer": "standard",
"filter": ["lowercase", "my_stopwords"]
}
}
}
}
}
测试
GET /my_index/_analyze
{
"text": "tom&jerry are a friend in the house, <a>, HAHA!!",
"analyzer": "my_analyzer"
}
结果
{
"tokens": [
{
"token": "tomandjerry",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "are",
"start_offset": 10,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "friend",
"start_offset": 16,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "in",
"start_offset": 23,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "house",
"start_offset": 30,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "haha",
"start_offset": 42,
"end_offset": 46,
"type": "<ALPHANUM>",
"position": 7
}
]
}
可以发现自动将&转换成了and,自动去掉了are a等
可以为我们定制的分词器添加field
PUT /my_index/_mapping/my_type
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
若有兴趣,欢迎来加入群,【Java初学者学习交流群】:458430385,此群有Java开发人员、UI设计人员和前端工程师。有问必答,共同探讨学习,一起进步!
欢迎关注我的微信公众号【Java码农社区】,会定时推送各种干货:
qrcode_for_gh_577b64e73701_258.jpg