-
Notifications
You must be signed in to change notification settings - Fork 111
Feature: Handle Arabic #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@antonlindstrom @navruzm Your review please |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #90 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 2 2
Lines 153 167 +14
=========================================
+ Hits 153 167 +14
🚀 New features to boost your workflow:
|
|
@antonlindstrom @navruzm @matrixik @kvalev ur review please |
|
Hello, thank you for this change. Please revert change to Also please lower the number of examples in From my google search Arabic doesn't have upper letters but could you add one test for it like: |
|
Ok let me do it and sorry for the long time delay... |
- Revert BenchmarkMakeShortSymbols to use original special symbols string - Reduce Arabic examples in doc.go to 1 (trickiest case with plurals) - Add Arabic test case with Lowercase=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Arabic language support to the slug generation library by implementing character-to-Latin transliteration mappings and automatic detection of Arabic text. The implementation includes both individual character mappings and whole-word pattern substitutions to handle Arabic's complex morphology.
Key changes:
- Automatic detection of Arabic text in the
Make()function - Arabic character-to-Latin transliteration mappings (
arSub) - Pattern-based word/phrase substitutions (
alSub) for common Arabic words - Comprehensive test coverage for Arabic text scenarios
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| slug.go | Adds automatic Arabic detection to Make() function and implements pattern-based substitution logic in the Arabic case of MakeLang() |
| slug_test.go | Adds test cases for Arabic text in both TestSlugMake and TestSlugMakeLang functions |
| languages_substitution.go | Introduces arSub (rune mappings) and alSub (pattern mappings) for Arabic transliteration, adds arSub to the init function |
| doc.go | Adds example demonstrating Arabic language support usage |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func Make(s string) (slug string) { | ||
| // Check if the text contains Arabic characters | ||
| for _, r := range s { | ||
| if r >= '\u0600' && r <= '\u06FF' { |
Copilot
AI
Dec 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Arabic Unicode range check r >= '\u0600' && r <= '\u06FF' only covers the basic Arabic block (U+0600 to U+06FF). However, Arabic script extends beyond this range. Arabic Supplement (U+0750-U+077F), Arabic Extended-A (U+08A0-U+08FF), Arabic Extended-B (U+0870-U+089F), and Arabic Presentation Forms (U+FB50-U+FDFF, U+FE70-U+FEFF) are not covered. This means Arabic text using characters from these extended ranges won't be automatically detected as Arabic, leading to incorrect processing.
| if r >= '\u0600' && r <= '\u06FF' { | |
| if (r >= '\u0600' && r <= '\u06FF') || // Arabic | |
| (r >= '\u0750' && r <= '\u077F') || // Arabic Supplement | |
| (r >= '\u0870' && r <= '\u089F') || // Arabic Extended-B | |
| (r >= '\u08A0' && r <= '\u08FF') || // Arabic Extended-A | |
| (r >= '\uFB50' && r <= '\uFDFF') || // Arabic Presentation Forms-A | |
| (r >= '\uFE70' && r <= '\uFEFF') { // Arabic Presentation Forms-B |
| {"ar", "مرحبا بالعالم", "mrhba-balalm", false}, | ||
| {"ar", "السَّلامُ عَلَيْكُمْ", "alsalam-aalykm", true}, | ||
| {"ar", "اللُّغَة العَرَبِيَّة", "allgh-alaarby", true}, | ||
| {"ar", "مَكْتَبَة", "mktba", true}, |
Copilot
AI
Dec 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case has a duplicate with identical input and expected output, differing only in the lowercase parameter. Both cases expect "mrhba-balalm" regardless of whether lowercase is true or false. This suggests either the expected output for the false case is incorrect (should it preserve case?), or this is genuinely a duplicate test case that adds no value. Looking at other language tests (e.g., Bulgarian on lines 97-98), when lowercase is false, the output preserves the original casing pattern.
| {"ar", "مرحبا بالعالم", "mrhba-balalm", false}, | |
| {"ar", "السَّلامُ عَلَيْكُمْ", "alsalam-aalykm", true}, | |
| {"ar", "اللُّغَة العَرَبِيَّة", "allgh-alaarby", true}, | |
| {"ar", "مَكْتَبَة", "mktba", true}, | |
| {"ar", "السَّلامُ عَلَيْكُمْ", "alsalam-aalykm", true}, | |
| {"ar", "اللُّغَة العَرَبِيَّة", "allgh-alaarby", true}, | |
| {"ar", "مَكْتَبَة", "mktba", true}, | |
| {"ar", "مَكْتَبَة", "mktba", true}, |
| for _, pattern := range []string{ | ||
| // Common words and phrases | ||
| "المعلمون والمعلمات", | ||
| "شركة القاصة للخدمات الالكترونية", | ||
| "جامعة الكوفة", | ||
| // Words with diacritics | ||
| "السَّلامُ", | ||
| "عَلَيْكُمْ", | ||
| "اللُّغَة", | ||
| "العَرَبِيَّة", | ||
| "بَيْت", | ||
| "مَكْتَبَة", | ||
| "كِتَاب", | ||
| "قَلَم", | ||
| // Words without diacritics | ||
| "مكتبة", | ||
| "بيت", | ||
| "كتاب", | ||
| "قلم", | ||
| "سيف", | ||
| "حاكم", | ||
| "هدى", | ||
| "الهدى", | ||
| "شركة", | ||
| "القاصة", | ||
| "للخدمات", | ||
| "الالكترونية", | ||
| "جامعة", | ||
| "الكوفة", | ||
| "المعلمون", | ||
| "المعلمات", | ||
| // Basic patterns | ||
| "و", | ||
| "ال", | ||
| } { | ||
| if v, ok := alSub[pattern]; ok { | ||
| slug = strings.ReplaceAll(slug, pattern, v) | ||
| } | ||
| } |
Copilot
AI
Dec 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded list of Arabic patterns in the case statement is inefficient and unmaintainable. This approach iterates through every pattern string for every Arabic text conversion, even though these patterns are already defined in the alSub map. The code essentially duplicates all the keys from alSub, creating maintenance burden when new patterns need to be added.
Instead, the patterns from alSub should be applied directly using the existing Substitute function (which already handles string-to-string mappings), similar to how CustomSub is used for custom substitutions. This would eliminate the redundant list and ensure consistency between pattern definitions and their usage.
| for _, pattern := range []string{ | ||
| // Common words and phrases | ||
| "المعلمون والمعلمات", | ||
| "شركة القاصة للخدمات الالكترونية", | ||
| "جامعة الكوفة", | ||
| // Words with diacritics | ||
| "السَّلامُ", | ||
| "عَلَيْكُمْ", | ||
| "اللُّغَة", | ||
| "العَرَبِيَّة", | ||
| "بَيْت", | ||
| "مَكْتَبَة", | ||
| "كِتَاب", | ||
| "قَلَم", | ||
| // Words without diacritics | ||
| "مكتبة", | ||
| "بيت", | ||
| "كتاب", | ||
| "قلم", | ||
| "سيف", | ||
| "حاكم", | ||
| "هدى", | ||
| "الهدى", | ||
| "شركة", | ||
| "القاصة", | ||
| "للخدمات", | ||
| "الالكترونية", | ||
| "جامعة", | ||
| "الكوفة", | ||
| "المعلمون", | ||
| "المعلمات", | ||
| // Basic patterns | ||
| "و", | ||
| "ال", | ||
| } { | ||
| if v, ok := alSub[pattern]; ok { | ||
| slug = strings.ReplaceAll(slug, pattern, v) | ||
| } | ||
| } |
Copilot
AI
Dec 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string pattern matching approach is highly inefficient due to repeated lookups and replacements. For each input string, the code iterates through 39 hardcoded patterns, performs a map lookup for each pattern, and then does a string replacement. This creates O(n*m) complexity where n is the number of patterns and m is the length of the input string.
The existing Substitute function already handles string-to-string mappings efficiently by sorting keys and applying them in order. The Arabic case should simply call slug = Substitute(slug, alSub) before the rune substitution, eliminating the manual iteration and improving performance.
| {"Dobrosław Żybort", "dobroslaw-zybort"}, | ||
| {"Ala ma 6 kotów.", "ala-ma-6-kotow"}, | ||
|
|
||
| { "المعلمون والمعلمات", "almalmon-waalmalmat"}, |
Copilot
AI
Dec 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent spacing detected. The opening brace has an extra space before the string literal, which differs from the formatting pattern used in other test cases in this file. The format should be {"المعلمون والمعلمات", "almalmon-waalmalmat"}, without the extra space after the opening brace.
| { "المعلمون والمعلمات", "almalmon-waalmalmat"}, | |
| {"المعلمون والمعلمات", "almalmon-waalmalmat"}, |
| var alSub = map[string]string{ | ||
| // Test case patterns | ||
| "السَّلامُ": "alsalam", // the peace with diacritics | ||
| "عَلَيْكُمْ": "aalykm", // upon you with diacritics | ||
| "اللُّغَة": "allgh", // the language with diacritics | ||
| "العَرَبِيَّة": "alaarby", // the Arabic with diacritics | ||
| "بَيْت": "bayt", // house with diacritics | ||
| "مَكْتَبَة": "mktba", // library with diacritics | ||
| "كِتَاب": "ktab", // book with diacritics | ||
| "قَلَم": "qlm", // pen with diacritics | ||
| "سيف": "saif", // sword | ||
| "مرحبا": "mrhba", // hello | ||
| "بالعالم": "balalm", // in the world | ||
| "حاكم": "haikm", // ruler | ||
| "هدى": "huda", // guidance | ||
| "الهدى": "alhuda", // the guidance | ||
| "شركة": "shrka", // company | ||
| "القاصة": "alqaseh", // clearing | ||
| "للخدمات": "llkhdmat", // for services | ||
| "الالكترونية": "alalktrnaia", // electronic | ||
| "جامعة": "jamat", // university | ||
| "الكوفة": "alkufa", // Kufa | ||
| "المعلمون": "almalmon", // the teachers (m) | ||
| "المعلمات": "almalmat", // the teachers (f) | ||
| "و": "wa", // and | ||
|
|
||
| // Common word endings | ||
| "ية": "ia", // feminine ending | ||
| "ات": "at", // feminine plural | ||
| "ون": "on", // masculine plural | ||
| "ين": "in", // masculine plural/dual | ||
|
|
||
| // Common prefixes | ||
| "ال": "al", // the | ||
| "بال": "bal", // with the | ||
| "كال": "kal", // like the | ||
| "فال": "fal", // so the | ||
|
|
||
| // Common patterns with ain | ||
| "عا": "aa", // ain + alif | ||
| "عي": "ee", // ain + yaa | ||
| "عو": "oo", // ain + waw | ||
|
|
||
| // Special combinations | ||
| "الله": "allah", // Allah | ||
| "عبد": "abd", // Abd (servant) | ||
| "محمد": "muhammad", // Muhammad | ||
| "احمد": "ahmad", // Ahmad | ||
| } |
Copilot
AI
Dec 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alSub map contains overlapping patterns that can cause unpredictable substitution results. For example, "ية" (feminine ending), "ات" (feminine plural), and "ال" (the) are short patterns that will match substrings within longer words. When "المعلمات" is being processed, these smaller patterns could incorrectly match parts of it.
The Substitute function processes patterns in alphabetical order, which means shorter generic patterns might be applied before or after longer specific patterns depending on their alphabetical ordering, not their logical precedence. This can lead to incorrect transliterations. Consider organizing patterns by specificity (longer patterns first) or ensuring the patterns don't have unintended overlaps.
| // Add custom substitutions for common patterns | ||
| var alSub = map[string]string{ |
Copilot
AI
Dec 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name alSub is ambiguous and inconsistent with the naming pattern used for other language substitution maps. Other languages use the language code as the prefix (e.g., bgSub, csSub, deSub), but alSub appears to stand for "Arabic Language Substitution" or possibly "al" (the Arabic definite article), neither of which is clear.
For consistency with the existing pattern, this should be named something like arStringSub or arPatternSub to distinguish it from arSub (which handles rune-to-string mappings) while maintaining the language code prefix convention.
| // Add custom substitutions for common patterns | |
| var alSub = map[string]string{ | |
| // Add custom substitutions for common Arabic string patterns | |
| var arStringSub = map[string]string{ |
| // Check if the text contains Arabic characters | ||
| for _, r := range s { | ||
| if r >= '\u0600' && r <= '\u06FF' { | ||
| return MakeLang(s, "ar") | ||
| } | ||
| } |
Copilot
AI
Dec 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The automatic Arabic detection changes the behavior of the Make function in a way that may be unexpected for users. Previously, Make consistently used English substitution. Now it special-cases Arabic, which creates an inconsistency - why is Arabic automatically detected but not other languages like Chinese, Cyrillic, or Greek?
This also has implications for mixed-language text. If a user has text that contains both English and Arabic characters, the entire string will be processed as Arabic, which may not be the desired behavior. The automatic detection should either be documented more prominently as a breaking change, or this feature should be opt-in rather than automatic.
This Close #89