User:TJones (WMF)/Notes/Normalization for Arabic Script Across Languages

From mediawiki.org

April 2024 — See TJones_(WMF)/Notes for other projects. See also T72899. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background[edit]

The Phab ticket is from 2014 (hence only having a 5-digit number!), and some things have changed since it was opened—like our migration to Elasticsearch (huge!) and generally improved normalization across languages (comparatively small), but let's see where we are now....

As noted in the ticket, different languages that use the Arabic script have different preferred versions of the Arabic letters kaf, yeh, and heh.

  • kaf: ك Arabic U+0643; ڪ Urdu U+06AA; ﻙ Pashto U+FED9; ﻚ Uyghur U+FEDA; ک Persian U+06A9
  • yeh: ي Arabic U+064A; ی Persian U+06CC; ى Urdu U+0649; ۍ Pashto U+06CD; ې Uyghur U+06D0
  • heh: ہ Pashto U+06C1; ە Kurdish U+06D5; ه Persian U+0647

Depending on its position in a word, these characters can look identical, similar, or distinctly different. Also, many of the letters also have isolated, initial, media, and final forms. These forms are often used to show the form the main character takes in different contexts outside those contexts, though they are also just used as they are sometimes—especially in older texts written in the stone age when font support was not what it is today. On the other hand, the Uyghur kaf is the Unicode final form of kaf—presumably so that it doesn't vary its shape by context.

Existing Analyzers[edit]

Arabic (ar), Moroccan Arabic (ary), Egyptian Arabic (arz), Persian (fa), and Sorani (ckb) all have existing language-specific analyzers, which include various normalization filters. Some components of these filters follow a reliable pattern:

  • lowercase (upgraded to icu_normalizer) → decimal_digit → stop word filtering → stemmer → icu_folding.
    • decimal_digit normalizes lots of digit forms, including Arabic-Indic digits (١٢٣ → 123)

The language-specific normalization steps are, literally, all over the place:

  • The Arabic varieties add arabic_normalization after stop word filtering.
  • Persian uses both arabic_normalization and persian_normalization, after decimal_digit and before stop word filtering.
    • Also, Persion doesn't have a stemmer!
  • Sorani uses sorani_normalization before lowercasing/icu_normalizer.

They also map different characters differently. Of the relevant characters above:

  • arabic_normalization does very little and converts Urdu U+0649 ى to Arabic U+064A ي
  • persian_normalization converts Persian U+06A9 ک to Arabic U+0643 ك, Persian U+06CC ی to Arabic U+064A ي and Pashto U+06C1 ہ to Persian U+0647 ه,
  • sorani_normalization prefers Persian forms more than persian_normalization does, converting Arabic U+0643 ك to Persian U+06A9 ک, Arabic U+064A ي and Urdu U+0649 ى both to Persian U+06CC ی, and Persian U+0647 ه to Kurdish U+06D5 ە (Sorani is a dialect of Kurdish).
  • icu_normalizer prefers Arabic forms, converting Pashto U+FED9 ﻙ and Uyghur U+FEDA ﻚ both to Arabic U+0643 ك.
  • icu_folding also prefers Arabic forms, converting Persian U+06A9 ک, Pashto U+FED9 ﻙ, and Uyghur U+FEDA ﻚ all to Arabic U+0643 ك, and Persian U+06CC ی and Urdu U+0649 ى both to Arabic U+064A ي.

Ignoring Sorani for a moment, all of these generally point to converting everything that can be to the standard Arabic form, and everything else to the standard Persian form.

That approach won't work for Sorani, though, especially since it's normalization comes so early in the analysis chain. Converting to Arabic forms in Sorani text will break its stop word filtering and stemming.

Other Arabic-Script Languages[edit]

Other languages that use the Arabic script but do not have language-specific analysis chains are Urdu (ur), Pashto (ps), Uyghur (ug), Kurdish (ku), South Azerbaijani (azb), Gilaki (glk), Kashmiri (ks), Mazanderani (mzn), Western Punjabi (pnb), Sindhi (sd), Saraiki (skr). We'll have to pay a little extra attention to those.

Mega-Mappings[edit]

For the Arabic/Persian/default/non-Sorani mappings, all of the following get merged together:

  • 10 for kaf (ك): ﮎ, ﮏ, ﮐ, ﮑ, ک, ڪ, ﻛ, ﻜ, ﻙ, ﻚ
  • 16 for yeh (ي): ﯼ, ﯽ, ﯾ, ﯿ, ی, ﯨ, ﯩ, ﻯ, ﻰ, ى, ۍ, ﯤ, ﯥ, ﯦ, ﯧ, ې
  • 9 for heh (ه): ﮦ, ﮧ, ﮨ, ﮩ, ہ, ۀ, ﮤ, ﮥ, ە

I created a similar mapping for Sorani, mapping many variants to keheh (ک), Farsi yeh (ی), and ae (ە)—though it turned out not to be necessary...

Results[edit]

I have a huge pile of data from Wikipedias and queries from the Harmonization work, so I tested my samples from many languages.

A fair number of languages didn't have any changes, because they didn't have any non-standard Arabic characters in their samples. Others showed some effects. (Languages with Arabic scripts in bold.)

  • Languages that showed some changes in tokens output (meaning that relevant text appears in the sample): Persian, Kurdish, English, Afrikaans, Alemannic, Asturian, Assamese, Belarusian, Breton, Welsh, Greek, French, Irish, Galician, Gujarati, Hebrew, Hindi, Indonesian, Igbo, Italian, Georgian, Kannada, Korean, Kyrgyz, Latin, Macedonian, Malayalam, Marathi, Malay, Nepali, Oriya, Polish, Russian, Scots, Sinhala, Slovenian, Swedish, Swahili, Tamil, Telugu, Thai, Tagalog, Cantonese, Amharic, Aramaic, Gothic, Lao
  • Languages that had some token mergers (i.e., tokens that previously didn't match now do): Moroccan Arabic, Urdu, Pashto, Uyghur, South Azerbaijani, Gilaki, Kashmiri, Mazanderani, Western Punjabi, Sindhi, Saraiki, Azerbaijani, Belarusian-Taraškievica, Bangla, Luxembourgish, Mongolian, Punjabi, Albanian, Tajik, Uzbek, Divehi, N’Ko, Santali, Standard Moroccan Tamazight, Chinese
    • Chinese is a weird case, since it breaks up Arabic (and other non-Chinese, non-Latin) text into single characters.
  • Languages that had some token mergers, as above, and some token splits (meaning that some tokens that previously matched no longer do): Arabic, Egyptian Arabic, Sorani

In the case of Arabic and Egyptian Arabic, the splits were more specifically token losses, due to words being normalized to stop words.

Most samples looked either really good (mostly made up of visually identical forms being merged) or largely unaffected—except Sorani, which was a mess! There were loads of token splits, generally all bad, and almost no token mergers—so I gave up and just ignored Sorani. It doesn't benefit from the Arabic/Persian/default normalization, and doesn't need its own (additional) normalization.

Arabic Stop Words[edit]

In the Arabic samples, a small number of words got normalized to stop words, and a few stop words got normalized to non–stop words. After looking them up and consulting online translators, I just added them to the Arabic extra stop word list.

I also decided to be bold, and apply the extra stop words from Egyptian and Moroccan Arabic to Standard Arabic (and to simplify their shared language analysis config).

Timing Results and Caveats[edit]

I ran some timing tests on loading ~90MB samples of Arabic, English, Russian, Chinese, Japanese, and Thai. The increase in load time ranged from slightly less than zero (noise) to +3.5% for Arabic, with an overall average under +2%. Note that Arabic picked up another stop word filter as well as a normalization character filter.

It's possible (almost guaranteed) that the normalization for Urdu, Pashto, Uyghur, Kurdish, South Azerbaijani, Gilaki, Kashmiri, Mazanderani, Western Punjabi, Sindhi, and Saraiki are not ideal. For now, it doesn't matter, because the mappings only affect internal representations, and merge tokens we want merged. However, if we ever pick up additional language analysis (stop word filters or stemmers) for any of these languages, those intermediate internal representations will matter, and we may have to revisit and customize the mappings to get good results, especially from a stemmer! (You can alway pile extra forms into a stop word list, within reason.)