Description

The filter hindi_normalization is present in our analysis chain because it's in the Elastic monolithic analyzer, and it's been installed for a long time, as part of the monolithic analyzer or the unpacked analyzer. However, it strips viramas, which seems suboptimal:

When unpacking Hindi (written in Devanagari), ICU folding did not require any exceptions (i.e., enabling ICU folding did not seem to have any effect on Hindi tokens). At the time, I didn't think anything of it. Here, every Indic script seems to need at least an exception for its virama. In fact, Marathi, also written in the Devanagari script, needs an exception for the virama. It turns out that filter hindi_normalization provided by Lucene via Elastic, strips viramas as part of a larger set of normalizations. This seems contrary to my understanding of Indic scripts in general, and the advice I got from Santhosh on viramas in general. It is also reminiscent of the overly aggressive normalization done by the bengali_normalization filter, which we disabled. This needs to be investigated and at least briefly discussed with Hindi-speaking wiki users.

We should look at it's other mappings, too, and decide if any of the others are suboptimal.

We could create mapping filters (slower, no dependencies) or a plugin (more complex, dependent on plugin) to provide the best subset of mappings, if needed.

Event Timeline

Code of Conduct · GPL ·