Feature: Handle Arabic #90

1saifj · 2024-12-17T08:23:34Z

This Close #89

1saifj · 2024-12-17T08:24:26Z

@antonlindstrom @navruzm Your review please

codecov · 2024-12-17T08:26:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (62efd7f) to head (8a14f45).

Additional details and impacted files

@@            Coverage Diff            @@
##            master       #90   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            2         2           
  Lines          153       167   +14     
=========================================
+ Hits           153       167   +14

Files with missing lines	Coverage Δ
languages_substitution.go	`100.00% <100.00%> (ø)`
slug.go	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

1saifj · 2024-12-22T21:45:23Z

@antonlindstrom @navruzm @matrixik @kvalev ur review please

matrixik · 2024-12-23T17:57:18Z

Hello, thank you for this change.

Please revert change to BenchmarkMakeShortSymbols (there is already test for simple text BenchmarkMakeShortAscii).

Also please lower the number of examples in doc.go to 1 example (best would be the trickiest one, it's there to show it's possible, not to show all abilities for one language).

From my google search Arabic doesn't have upper letters but could you add one test for it like:

{"ar", "مرحبا بالعالم", "mrhba-balalm", false},

1saifj · 2025-12-21T08:26:18Z

Ok let me do it and sorry for the long time delay...

- Revert BenchmarkMakeShortSymbols to use original special symbols string - Reduce Arabic examples in doc.go to 1 (trickiest case with plurals) - Add Arabic test case with Lowercase=false

Copilot

Pull request overview

This PR adds Arabic language support to the slug generation library by implementing character-to-Latin transliteration mappings and automatic detection of Arabic text. The implementation includes both individual character mappings and whole-word pattern substitutions to handle Arabic's complex morphology.

Key changes:

Automatic detection of Arabic text in the Make() function
Arabic character-to-Latin transliteration mappings (arSub)
Pattern-based word/phrase substitutions (alSub) for common Arabic words
Comprehensive test coverage for Arabic text scenarios

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
slug.go	Adds automatic Arabic detection to `Make()` function and implements pattern-based substitution logic in the Arabic case of `MakeLang()`
slug_test.go	Adds test cases for Arabic text in both `TestSlugMake` and `TestSlugMakeLang` functions
languages_substitution.go	Introduces `arSub` (rune mappings) and `alSub` (pattern mappings) for Arabic transliteration, adds `arSub` to the init function
doc.go	Adds example demonstrating Arabic language support usage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-21T08:35:38Z

slug.go

 func Make(s string) (slug string) {
+	// Check if the text contains Arabic characters
+	for _, r := range s {
+		if r >= '\u0600' && r <= '\u06FF' {


The Arabic Unicode range check r >= '\u0600' && r <= '\u06FF' only covers the basic Arabic block (U+0600 to U+06FF). However, Arabic script extends beyond this range. Arabic Supplement (U+0750-U+077F), Arabic Extended-A (U+08A0-U+08FF), Arabic Extended-B (U+0870-U+089F), and Arabic Presentation Forms (U+FB50-U+FDFF, U+FE70-U+FEFF) are not covered. This means Arabic text using characters from these extended ranges won't be automatically detected as Arabic, leading to incorrect processing.

Suggested change

if r >= '\u0600' && r <= '\u06FF' {

if (r >= '\u0600' && r <= '\u06FF') || // Arabic

(r >= '\u0750' && r <= '\u077F') || // Arabic Supplement

(r >= '\u0870' && r <= '\u089F') || // Arabic Extended-B

(r >= '\u08A0' && r <= '\u08FF') || // Arabic Extended-A

(r >= '\uFB50' && r <= '\uFDFF') || // Arabic Presentation Forms-A

(r >= '\uFE70' && r <= '\uFEFF') { // Arabic Presentation Forms-B

Copilot · 2025-12-21T08:35:38Z

slug_test.go

+		{"ar", "مرحبا بالعالم", "mrhba-balalm", false},
+		{"ar", "السَّلامُ عَلَيْكُمْ", "alsalam-aalykm", true},
+		{"ar", "اللُّغَة العَرَبِيَّة", "allgh-alaarby", true},
+		{"ar", "مَكْتَبَة", "mktba", true},


The test case has a duplicate with identical input and expected output, differing only in the lowercase parameter. Both cases expect "mrhba-balalm" regardless of whether lowercase is true or false. This suggests either the expected output for the false case is incorrect (should it preserve case?), or this is genuinely a duplicate test case that adds no value. Looking at other language tests (e.g., Bulgarian on lines 97-98), when lowercase is false, the output preserves the original casing pattern.

Suggested change

{"ar", "مرحبا بالعالم", "mrhba-balalm", false},

{"ar", "السَّلامُ عَلَيْكُمْ", "alsalam-aalykm", true},

{"ar", "اللُّغَة العَرَبِيَّة", "allgh-alaarby", true},

{"ar", "مَكْتَبَة", "mktba", true},

{"ar", "السَّلامُ عَلَيْكُمْ", "alsalam-aalykm", true},

{"ar", "اللُّغَة العَرَبِيَّة", "allgh-alaarby", true},

{"ar", "مَكْتَبَة", "mktba", true},

{"ar", "مَكْتَبَة", "mktba", true},

Copilot · 2025-12-21T08:35:39Z

slug.go

+		for _, pattern := range []string{
+			// Common words and phrases
+			"المعلمون والمعلمات",
+			"شركة القاصة للخدمات الالكترونية",
+			"جامعة الكوفة",
+			// Words with diacritics
+			"السَّلامُ",
+			"عَلَيْكُمْ",
+			"اللُّغَة",
+			"العَرَبِيَّة",
+			"بَيْت",
+			"مَكْتَبَة",
+			"كِتَاب",
+			"قَلَم",
+			// Words without diacritics
+			"مكتبة",
+			"بيت",
+			"كتاب",
+			"قلم",
+			"سيف",
+			"حاكم",
+			"هدى",
+			"الهدى",
+			"شركة",
+			"القاصة",
+			"للخدمات",
+			"الالكترونية",
+			"جامعة",
+			"الكوفة",
+			"المعلمون",
+			"المعلمات",
+			// Basic patterns
+			"و",
+			"ال",
+		} {
+			if v, ok := alSub[pattern]; ok {
+				slug = strings.ReplaceAll(slug, pattern, v)
+			}
+		}


The hardcoded list of Arabic patterns in the case statement is inefficient and unmaintainable. This approach iterates through every pattern string for every Arabic text conversion, even though these patterns are already defined in the alSub map. The code essentially duplicates all the keys from alSub, creating maintenance burden when new patterns need to be added.

Instead, the patterns from alSub should be applied directly using the existing Substitute function (which already handles string-to-string mappings), similar to how CustomSub is used for custom substitutions. This would eliminate the redundant list and ensure consistency between pattern definitions and their usage.

Copilot · 2025-12-21T08:35:39Z

slug.go

+		for _, pattern := range []string{
+			// Common words and phrases
+			"المعلمون والمعلمات",
+			"شركة القاصة للخدمات الالكترونية",
+			"جامعة الكوفة",
+			// Words with diacritics
+			"السَّلامُ",
+			"عَلَيْكُمْ",
+			"اللُّغَة",
+			"العَرَبِيَّة",
+			"بَيْت",
+			"مَكْتَبَة",
+			"كِتَاب",
+			"قَلَم",
+			// Words without diacritics
+			"مكتبة",
+			"بيت",
+			"كتاب",
+			"قلم",
+			"سيف",
+			"حاكم",
+			"هدى",
+			"الهدى",
+			"شركة",
+			"القاصة",
+			"للخدمات",
+			"الالكترونية",
+			"جامعة",
+			"الكوفة",
+			"المعلمون",
+			"المعلمات",
+			// Basic patterns
+			"و",
+			"ال",
+		} {
+			if v, ok := alSub[pattern]; ok {
+				slug = strings.ReplaceAll(slug, pattern, v)
+			}
+		}


The string pattern matching approach is highly inefficient due to repeated lookups and replacements. For each input string, the code iterates through 39 hardcoded patterns, performs a map lookup for each pattern, and then does a string replacement. This creates O(n*m) complexity where n is the number of patterns and m is the length of the input string.

The existing Substitute function already handles string-to-string mappings efficiently by sorting keys and applying them in order. The Arabic case should simply call slug = Substitute(slug, alSub) before the rune substitution, eliminating the manual iteration and improving performance.

Copilot · 2025-12-21T08:35:39Z

slug_test.go

 		{"Dobrosław Żybort", "dobroslaw-zybort"},
 		{"Ala ma 6 kotów.", "ala-ma-6-kotow"},

+		{ "المعلمون والمعلمات", "almalmon-waalmalmat"},


Inconsistent spacing detected. The opening brace has an extra space before the string literal, which differs from the formatting pattern used in other test cases in this file. The format should be {"المعلمون والمعلمات", "almalmon-waalmalmat"}, without the extra space after the opening brace.

Suggested change

{ "المعلمون والمعلمات", "almalmon-waalmalmat"},

{"المعلمون والمعلمات", "almalmon-waalmalmat"},

Copilot · 2025-12-21T08:35:39Z

languages_substitution.go

+var alSub = map[string]string{
+	// Test case patterns
+	"السَّلامُ":    "alsalam",   // the peace with diacritics
+	"عَلَيْكُمْ":   "aalykm",    // upon you with diacritics
+	"اللُّغَة":     "allgh",     // the language with diacritics
+	"العَرَبِيَّة": "alaarby",   // the Arabic with diacritics
+	"بَيْت":        "bayt",      // house with diacritics
+	"مَكْتَبَة":    "mktba",     // library with diacritics
+	"كِتَاب":       "ktab",      // book with diacritics
+	"قَلَم":        "qlm",       // pen with diacritics
+	"سيف":          "saif",      // sword
+	"مرحبا":        "mrhba",     // hello
+	"بالعالم":      "balalm",    // in the world
+	"حاكم":         "haikm",     // ruler
+	"هدى":          "huda",      // guidance
+	"الهدى":        "alhuda",    // the guidance
+	"شركة":         "shrka",     // company
+	"القاصة":       "alqaseh",   // clearing
+	"للخدمات":      "llkhdmat",  // for services
+	"الالكترونية":  "alalktrnaia", // electronic
+	"جامعة":        "jamat",     // university
+	"الكوفة":       "alkufa",    // Kufa
+	"المعلمون":     "almalmon",  // the teachers (m)
+	"المعلمات":     "almalmat",  // the teachers (f)
+	"و":            "wa",        // and
+
+	// Common word endings
+	"ية": "ia",  // feminine ending
+	"ات": "at",  // feminine plural
+	"ون": "on",  // masculine plural
+	"ين": "in",  // masculine plural/dual
+
+	// Common prefixes
+	"ال": "al",  // the
+	"بال": "bal", // with the
+	"كال": "kal", // like the
+	"فال": "fal", // so the
+
+	// Common patterns with ain
+	"عا": "aa", // ain + alif
+	"عي": "ee", // ain + yaa
+	"عو": "oo", // ain + waw
+
+	// Special combinations
+	"الله": "allah",    // Allah
+	"عبد":  "abd",      // Abd (servant)
+	"محمد": "muhammad", // Muhammad
+	"احمد": "ahmad",    // Ahmad
+}


The alSub map contains overlapping patterns that can cause unpredictable substitution results. For example, "ية" (feminine ending), "ات" (feminine plural), and "ال" (the) are short patterns that will match substrings within longer words. When "المعلمات" is being processed, these smaller patterns could incorrectly match parts of it.

The Substitute function processes patterns in alphabetical order, which means shorter generic patterns might be applied before or after longer specific patterns depending on their alphabetical ordering, not their logical precedence. This can lead to incorrect transliterations. Consider organizing patterns by specificity (longer patterns first) or ensuring the patterns don't have unintended overlaps.

Copilot · 2025-12-21T08:35:39Z

languages_substitution.go

+// Add custom substitutions for common patterns
+var alSub = map[string]string{


The variable name alSub is ambiguous and inconsistent with the naming pattern used for other language substitution maps. Other languages use the language code as the prefix (e.g., bgSub, csSub, deSub), but alSub appears to stand for "Arabic Language Substitution" or possibly "al" (the Arabic definite article), neither of which is clear.

For consistency with the existing pattern, this should be named something like arStringSub or arPatternSub to distinguish it from arSub (which handles rune-to-string mappings) while maintaining the language code prefix convention.

Suggested change

// Add custom substitutions for common patterns

var alSub = map[string]string{

// Add custom substitutions for common Arabic string patterns

var arStringSub = map[string]string{

Copilot · 2025-12-21T08:35:40Z

slug.go

+	// Check if the text contains Arabic characters
+	for _, r := range s {
+		if r >= '\u0600' && r <= '\u06FF' {
+			return MakeLang(s, "ar")
+		}
+	}


The automatic Arabic detection changes the behavior of the Make function in a way that may be unexpected for users. Previously, Make consistently used English substitution. Now it special-cases Arabic, which creates an inconsistency - why is Arabic automatically detected but not other languages like Chinese, Cyrillic, or Greek?

This also has implications for mixed-language text. If a user has text that contains both English and Arabic characters, the entire string will be processed as Arabic, which may not be the desired behavior. The automatic detection should either be documented more prominently as a breaking change, or this feature should be opt-in rather than automatic.

1saifj changed the title ~~Handle Arabic~~ Feature: Handle Arabic Dec 17, 2024

matrixik added the enhancement label Dec 23, 2024

1saifj added 6 commits December 21, 2025 11:30

feat(supports-arabic): support comprehensive Arabic texts

5155d40

add one more test case

0e7ccb8

add check for arabic chars and special cases

c8d548b

handle more speicla cases

9a48b02

add more test cases

1fdaf8d

Address PR review feedback

8a14f45

- Revert BenchmarkMakeShortSymbols to use original special symbols string - Reduce Arabic examples in doc.go to 1 (trickiest case with plurals) - Add Arabic test case with Lowercase=false

1saifj force-pushed the master branch from 604410a to 8a14f45 Compare December 21, 2025 08:31

Copilot AI review requested due to automatic review settings December 21, 2025 08:31

Copilot started reviewing on behalf of 1saifj December 21, 2025 08:32 View session

Copilot AI reviewed Dec 21, 2025

View reviewed changes

-		if r >= '\u0600' && r <= '\u06FF' {
+		if (r >= '\u0600' && r <= '\u06FF') || // Arabic
+			(r >= '\u0750' && r <= '\u077F') || // Arabic Supplement
+			(r >= '\u0870' && r <= '\u089F') || // Arabic Extended-B
+			(r >= '\u08A0' && r <= '\u08FF') || // Arabic Extended-A
+			(r >= '\uFB50' && r <= '\uFDFF') || // Arabic Presentation Forms-A
+			(r >= '\uFE70' && r <= '\uFEFF') { // Arabic Presentation Forms-B

	{ "المعلمون والمعلمات", "almalmon-waalmalmat"},
	{"المعلمون والمعلمات", "almalmon-waalmalmat"},

		// Add custom substitutions for common patterns
		var alSub = map[string]string{

Feature: Handle Arabic #90

Are you sure you want to change the base?

Feature: Handle Arabic #90

Uh oh!

Conversation

1saifj commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1saifj commented Dec 17, 2024

Uh oh!

codecov bot commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

1saifj commented Dec 22, 2024

Uh oh!

matrixik commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1saifj commented Dec 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1saifj commented Dec 17, 2024 •

edited

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading

matrixik commented Dec 23, 2024 •

edited

Loading