User Story: As a searcher, I expect tokens in alphabetic scripts to be split on whitespace or punctuation, not internally, and especially not dependent on the context of other possibly distant or unrelated tokens.
The icu_tokenizer splits tokens on "script changes", but oddly this can span many tokens. For example, Д12345 67890 98765 43210 3x will split between the 3 and x because of the Д (numbers belong to no particular script so they inherit the script that proceeds them, even across spaces, which really looks like an error). This effect can go across long spans of text, and across sentence boundaries (e.g., "... for 'peace' is 'мир'. 3a illustrates ..." will split 3 + a because of мир).
Genuine multi-script tokens like NGiИX are also split (in this case, NGi + И + X). Homoglyph tokens (e.g., chocоlate, where the bold letter is Cyrillic) are also split up.
In the case of genuine multi-script tokens, this impairs precision, because it's possible to match the pieces when they are not part of the whole. (NGi + И + X is not so likely, but other cases are more likely to have false positives). Homoglyph tokens cannot be repaired (choc, о, and late are all separate and not detectable by the homoglyph plugin), impairing recall. In the case of number+letter tokens (like 3x above), recall is also impaired because the parts do not match the whole (3x in a query does not match 3 + x in an article).
The standard tokenizer doesn't do this (it also doesn't handle many Asian scripts well at all, which is why we have put up with the icu_tokenizer's idiosyncrasies).
This is a known problem for mixes of Latin, Greek, Cyrillic, and numbers, though others should be investigated, too.
Fixing these also requires updating all the relevant bookkeeping info, so that proximity searches and ranking are not adversely affected. The cjk_bigram filter can serve as a model (it combines and splits tokens based on script and offsets).
Acceptance criteria:
- Create a plugin with an icu_token_repair filter that detects inappropriately split tokens and correctly repairs them.
Update AnalysisConfigBuilder to automatically include icu_token_repair when the icu_tokenizer is used and the plugin is available.(Moved to T356643)
Related ObjectsStatus Subtype Assigned Task Resolved TJones T219550 [EPIC] Harmonize language analysis across languages Resolved TJones T332337 Repair multi-script tokens split by the ICU tokenizer Resolved RKemper T356651 Rebuild and deploy textify plugin Resolved TJones T356643 Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair Resolved EBernhardson T342444 Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair Resolved TJones T359100 Analyze results of harmonization
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | TJones | T219550 [EPIC] Harmonize language analysis across languages | |||
Resolved | TJones | T332337 Repair multi-script tokens split by the ICU tokenizer | |||
Resolved | RKemper | T356651 Rebuild and deploy textify plugin | |||
Resolved | TJones | T356643 Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair | |||
Resolved | EBernhardson | T342444 Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair | |||
Resolved | TJones | T359100 Analyze results of harmonization |