Skip to content

Conversation

@1saifj
Copy link

@1saifj 1saifj commented Dec 17, 2024

This Close #89

@1saifj
Copy link
Author

1saifj commented Dec 17, 2024

@antonlindstrom @navruzm Your review please

@1saifj 1saifj changed the title Handle Arabic Feature: Handle Arabic Dec 17, 2024
@codecov
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (62efd7f) to head (8a14f45).

Additional details and impacted files
@@            Coverage Diff            @@
##            master       #90   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            2         2           
  Lines          153       167   +14     
=========================================
+ Hits           153       167   +14     
Files with missing lines Coverage Δ
languages_substitution.go 100.00% <100.00%> (ø)
slug.go 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@1saifj
Copy link
Author

1saifj commented Dec 22, 2024

@antonlindstrom @navruzm @matrixik @kvalev ur review please

@matrixik
Copy link
Member

matrixik commented Dec 23, 2024

Hello, thank you for this change.

Please revert change to BenchmarkMakeShortSymbols (there is already test for simple text BenchmarkMakeShortAscii).

Also please lower the number of examples in doc.go to 1 example (best would be the trickiest one, it's there to show it's possible, not to show all abilities for one language).

From my google search Arabic doesn't have upper letters but could you add one test for it like:

{"ar", "مرحبا بالعالم", "mrhba-balalm", false},

@1saifj
Copy link
Author

1saifj commented Dec 21, 2025

Ok let me do it and sorry for the long time delay...

- Revert BenchmarkMakeShortSymbols to use original special symbols string
- Reduce Arabic examples in doc.go to 1 (trickiest case with plurals)
- Add Arabic test case with Lowercase=false
Copilot AI review requested due to automatic review settings December 21, 2025 08:31
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Arabic language support to the slug generation library by implementing character-to-Latin transliteration mappings and automatic detection of Arabic text. The implementation includes both individual character mappings and whole-word pattern substitutions to handle Arabic's complex morphology.

Key changes:

  • Automatic detection of Arabic text in the Make() function
  • Arabic character-to-Latin transliteration mappings (arSub)
  • Pattern-based word/phrase substitutions (alSub) for common Arabic words
  • Comprehensive test coverage for Arabic text scenarios

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
slug.go Adds automatic Arabic detection to Make() function and implements pattern-based substitution logic in the Arabic case of MakeLang()
slug_test.go Adds test cases for Arabic text in both TestSlugMake and TestSlugMakeLang functions
languages_substitution.go Introduces arSub (rune mappings) and alSub (pattern mappings) for Arabic transliteration, adds arSub to the init function
doc.go Adds example demonstrating Arabic language support usage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

func Make(s string) (slug string) {
// Check if the text contains Arabic characters
for _, r := range s {
if r >= '\u0600' && r <= '\u06FF' {
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arabic Unicode range check r >= '\u0600' && r <= '\u06FF' only covers the basic Arabic block (U+0600 to U+06FF). However, Arabic script extends beyond this range. Arabic Supplement (U+0750-U+077F), Arabic Extended-A (U+08A0-U+08FF), Arabic Extended-B (U+0870-U+089F), and Arabic Presentation Forms (U+FB50-U+FDFF, U+FE70-U+FEFF) are not covered. This means Arabic text using characters from these extended ranges won't be automatically detected as Arabic, leading to incorrect processing.

Suggested change
if r >= '\u0600' && r <= '\u06FF' {
if (r >= '\u0600' && r <= '\u06FF') || // Arabic
(r >= '\u0750' && r <= '\u077F') || // Arabic Supplement
(r >= '\u0870' && r <= '\u089F') || // Arabic Extended-B
(r >= '\u08A0' && r <= '\u08FF') || // Arabic Extended-A
(r >= '\uFB50' && r <= '\uFDFF') || // Arabic Presentation Forms-A
(r >= '\uFE70' && r <= '\uFEFF') { // Arabic Presentation Forms-B

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +85
{"ar", "مرحبا بالعالم", "mrhba-balalm", false},
{"ar", "السَّلامُ عَلَيْكُمْ", "alsalam-aalykm", true},
{"ar", "اللُّغَة العَرَبِيَّة", "allgh-alaarby", true},
{"ar", "مَكْتَبَة", "mktba", true},
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case has a duplicate with identical input and expected output, differing only in the lowercase parameter. Both cases expect "mrhba-balalm" regardless of whether lowercase is true or false. This suggests either the expected output for the false case is incorrect (should it preserve case?), or this is genuinely a duplicate test case that adds no value. Looking at other language tests (e.g., Bulgarian on lines 97-98), when lowercase is false, the output preserves the original casing pattern.

Suggested change
{"ar", "مرحبا بالعالم", "mrhba-balalm", false},
{"ar", "السَّلامُ عَلَيْكُمْ", "alsalam-aalykm", true},
{"ar", "اللُّغَة العَرَبِيَّة", "allgh-alaarby", true},
{"ar", "مَكْتَبَة", "mktba", true},
{"ar", "السَّلامُ عَلَيْكُمْ", "alsalam-aalykm", true},
{"ar", "اللُّغَة العَرَبِيَّة", "allgh-alaarby", true},
{"ar", "مَكْتَبَة", "mktba", true},
{"ar", "مَكْتَبَة", "mktba", true},

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +124
for _, pattern := range []string{
// Common words and phrases
"المعلمون والمعلمات",
"شركة القاصة للخدمات الالكترونية",
"جامعة الكوفة",
// Words with diacritics
"السَّلامُ",
"عَلَيْكُمْ",
"اللُّغَة",
"العَرَبِيَّة",
"بَيْت",
"مَكْتَبَة",
"كِتَاب",
"قَلَم",
// Words without diacritics
"مكتبة",
"بيت",
"كتاب",
"قلم",
"سيف",
"حاكم",
"هدى",
"الهدى",
"شركة",
"القاصة",
"للخدمات",
"الالكترونية",
"جامعة",
"الكوفة",
"المعلمون",
"المعلمات",
// Basic patterns
"و",
"ال",
} {
if v, ok := alSub[pattern]; ok {
slug = strings.ReplaceAll(slug, pattern, v)
}
}
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded list of Arabic patterns in the case statement is inefficient and unmaintainable. This approach iterates through every pattern string for every Arabic text conversion, even though these patterns are already defined in the alSub map. The code essentially duplicates all the keys from alSub, creating maintenance burden when new patterns need to be added.

Instead, the patterns from alSub should be applied directly using the existing Substitute function (which already handles string-to-string mappings), similar to how CustomSub is used for custom substitutions. This would eliminate the redundant list and ensure consistency between pattern definitions and their usage.

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +124
for _, pattern := range []string{
// Common words and phrases
"المعلمون والمعلمات",
"شركة القاصة للخدمات الالكترونية",
"جامعة الكوفة",
// Words with diacritics
"السَّلامُ",
"عَلَيْكُمْ",
"اللُّغَة",
"العَرَبِيَّة",
"بَيْت",
"مَكْتَبَة",
"كِتَاب",
"قَلَم",
// Words without diacritics
"مكتبة",
"بيت",
"كتاب",
"قلم",
"سيف",
"حاكم",
"هدى",
"الهدى",
"شركة",
"القاصة",
"للخدمات",
"الالكترونية",
"جامعة",
"الكوفة",
"المعلمون",
"المعلمات",
// Basic patterns
"و",
"ال",
} {
if v, ok := alSub[pattern]; ok {
slug = strings.ReplaceAll(slug, pattern, v)
}
}
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string pattern matching approach is highly inefficient due to repeated lookups and replacements. For each input string, the code iterates through 39 hardcoded patterns, performs a map lookup for each pattern, and then does a string replacement. This creates O(n*m) complexity where n is the number of patterns and m is the length of the input string.

The existing Substitute function already handles string-to-string mappings efficiently by sorting keys and applying them in order. The Arabic case should simply call slug = Substitute(slug, alSub) before the rune substitution, eliminating the manual iteration and improving performance.

Copilot uses AI. Check for mistakes.
{"Dobrosław Żybort", "dobroslaw-zybort"},
{"Ala ma 6 kotów.", "ala-ma-6-kotow"},

{ "المعلمون والمعلمات", "almalmon-waalmalmat"},
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing detected. The opening brace has an extra space before the string literal, which differs from the formatting pattern used in other test cases in this file. The format should be {"المعلمون والمعلمات", "almalmon-waalmalmat"}, without the extra space after the opening brace.

Suggested change
{ "المعلمون والمعلمات", "almalmon-waalmalmat"},
{"المعلمون والمعلمات", "almalmon-waalmalmat"},

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +149
var alSub = map[string]string{
// Test case patterns
"السَّلامُ": "alsalam", // the peace with diacritics
"عَلَيْكُمْ": "aalykm", // upon you with diacritics
"اللُّغَة": "allgh", // the language with diacritics
"العَرَبِيَّة": "alaarby", // the Arabic with diacritics
"بَيْت": "bayt", // house with diacritics
"مَكْتَبَة": "mktba", // library with diacritics
"كِتَاب": "ktab", // book with diacritics
"قَلَم": "qlm", // pen with diacritics
"سيف": "saif", // sword
"مرحبا": "mrhba", // hello
"بالعالم": "balalm", // in the world
"حاكم": "haikm", // ruler
"هدى": "huda", // guidance
"الهدى": "alhuda", // the guidance
"شركة": "shrka", // company
"القاصة": "alqaseh", // clearing
"للخدمات": "llkhdmat", // for services
"الالكترونية": "alalktrnaia", // electronic
"جامعة": "jamat", // university
"الكوفة": "alkufa", // Kufa
"المعلمون": "almalmon", // the teachers (m)
"المعلمات": "almalmat", // the teachers (f)
"و": "wa", // and

// Common word endings
"ية": "ia", // feminine ending
"ات": "at", // feminine plural
"ون": "on", // masculine plural
"ين": "in", // masculine plural/dual

// Common prefixes
"ال": "al", // the
"بال": "bal", // with the
"كال": "kal", // like the
"فال": "fal", // so the

// Common patterns with ain
"عا": "aa", // ain + alif
"عي": "ee", // ain + yaa
"عو": "oo", // ain + waw

// Special combinations
"الله": "allah", // Allah
"عبد": "abd", // Abd (servant)
"محمد": "muhammad", // Muhammad
"احمد": "ahmad", // Ahmad
}
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alSub map contains overlapping patterns that can cause unpredictable substitution results. For example, "ية" (feminine ending), "ات" (feminine plural), and "ال" (the) are short patterns that will match substrings within longer words. When "المعلمات" is being processed, these smaller patterns could incorrectly match parts of it.

The Substitute function processes patterns in alphabetical order, which means shorter generic patterns might be applied before or after longer specific patterns depending on their alphabetical ordering, not their logical precedence. This can lead to incorrect transliterations. Consider organizing patterns by specificity (longer patterns first) or ensuring the patterns don't have unintended overlaps.

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +101
// Add custom substitutions for common patterns
var alSub = map[string]string{
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name alSub is ambiguous and inconsistent with the naming pattern used for other language substitution maps. Other languages use the language code as the prefix (e.g., bgSub, csSub, deSub), but alSub appears to stand for "Arabic Language Substitution" or possibly "al" (the Arabic definite article), neither of which is clear.

For consistency with the existing pattern, this should be named something like arStringSub or arPatternSub to distinguish it from arSub (which handles rune-to-string mappings) while maintaining the language code prefix convention.

Suggested change
// Add custom substitutions for common patterns
var alSub = map[string]string{
// Add custom substitutions for common Arabic string patterns
var arStringSub = map[string]string{

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +67
// Check if the text contains Arabic characters
for _, r := range s {
if r >= '\u0600' && r <= '\u06FF' {
return MakeLang(s, "ar")
}
}
Copy link

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The automatic Arabic detection changes the behavior of the Make function in a way that may be unexpected for users. Previously, Make consistently used English substitution. Now it special-cases Arabic, which creates an inconsistency - why is Arabic automatically detected but not other languages like Chinese, Cyrillic, or Greek?

This also has implications for mixed-language text. If a user has text that contains both English and Arabic characters, the entire string will be processed as Arabic, which may not be the desired behavior. The automatic detection should either be documented more prominently as a breaking change, or this feature should be opt-in rather than automatic.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Support Arabic language

2 participants