13.1.6 Pattern Analyzer 正则解析器

正则解析器可以通过一个正则表达式来切分文本。它有以下的配置项：

配置项	描述
`lowercase`	是否需要被转成小写，默认是要的。
`pattern`	用来切分文本的正则表达式。默认是 `\W+`。
`flags`	正则表达式的标识。
`stopwords`	一个用来初始化停词过滤器的停词表。默认是空的。详情查阅 Stop Analyzer。

郑重声明

正则表达式是用来匹配 分隔符 的，而不是用来匹配被分割的字符串本身。

多个标识应该相互隔离，类似"CASE_INSENSITIVE|COMMENTS"。有关flags的更多细节请直接查看 Java Pattern API。

正则解析器示例

要试验这些示例，你得先删掉原有的测试用索引。

空格分词器

DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace": {
          "type": "pattern",
          "pattern": "\\s+"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=whitespace&text=foo,bar baz

# "foo,bar", "baz"

VIEW IN SENSE

非文字分析器

DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nonword": {
          "type": "pattern",
          "pattern": "[^\\w]+" 
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"

GET /test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"

VIEW IN SENSE

驼峰分析器

DELETE test

PUT /test?pretty=1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"

VIEW IN SENSE

这面那个正且正则表达式可以理解为：

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )

An analyzer of type pattern that can flexibly separate text into terms via a regular expression. Accepts the following settings:

The following are settings that can be set for a pattern analyzer type:

Settings	Description
`lowercase`	Should terms be lowercased or not. Defaults to `true`.
`pattern`	The regular expression pattern, defaults to `\W+`.
`flags`	The regular expression flags.
`stopwords`	A list of stopwords to initialize the stop filter with. Defaults to an empty stopword list Check Stop Analyzer for more details.

IMPORTANT: The regular expression should match the token separators, not the tokens themselves.

Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS". Check Java Pattern API for more details about flags options.

Pattern Analyzer Examples

In order to try out these examples, you should delete the test index before running each example.

Whitespace tokenizer

DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace": {
          "type": "pattern",
          "pattern": "\\s+"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=whitespace&text=foo,bar baz

# "foo,bar", "baz"

VIEW IN SENSE

Non-word character tokenizer

DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nonword": {
          "type": "pattern",
          "pattern": "[^\\w]+" 
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"

GET /test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"

VIEW IN SENSE

CamelCase tokenizer

DELETE test

PUT /test?pretty=1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"

VIEW IN SENSE

The regex above is easier to understand as:

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )

13.1.6 Pattern Analyzer 正则解析器

13.1.6 Pattern Analyzer 正则解析器

正则解析器示例

空格分词器

非文字分析器

驼峰分析器

Pattern Analyzer Examples

Whitespace tokenizer

Non-word character tokenizer

CamelCase tokenizer

results matching ""

No results matching ""