Skip to content

Configuring the locale for language and encoding‐aware operations

IS4 edited this page Jun 8, 2024 · 16 revisions

PawnPlus can take use of the system's cultural settings (the "locale") through mechanisms exposed by std::locale in C++, used for the purposes of formatting and character conversion and comparison.

Overview

When loaded, the plugin sets the global locale (via std::locale::global) to the invariant one (std::locale::classic, commonly identified as "C" or "POSIX") (so any previously-set locale through the server or environment variables will be ignored) and it supports modifying the global locale through pp_locale. It should be noted that other C++ modules may share the same global locale, so this settings affect them as well. The C locale (used by modules in C and set by std::setlocale) is not affected.

The locale can be applied or changed for any number of distinct categories, represented by locale_category:

enum locale_category (<<= 1)
{
    locale_none = 0,
    locale_collate = 1,
    locale_ctype,
    locale_monetary,
    locale_numeric,
    locale_time,
    locale_messages,
    locale_all = -1,
}

These categories affect the following areas of the plugin:

  • locale_collate controls character equivalence and comparisons. It is used only for regular expressions when regex_collate is set. In such a case, character ranges (e.g. [a-z]) will use the order of characters imposed by the locale.
  • locale_ctype specifies character categories (letter, digit, etc.) as well as lowercase and uppercase conversions. It is used for str_to_lower/str_set_to_lower, str_to_upper/str_set_to_upper, and regular expressions, either when character classes like \s or [[:alpha:]] are used, or with regex_icase.
  • locale_numeric defines how numbers are formatted, for example which character is used for the decimal point (e.g. . or ,). It is used by str_format and similar, including tag_op_string and tag_op_format.

To make the script encoding-aware, only locale_ctype is necessary. Since PawnPlus uses cell strings (with 32-bit instead of 8-bit characters), only the ANSI character range (0 to 255) has any special treatment, with values outside this range remaining unassigned (thus there is also no special support for Unicode characters if stored in that range).

Obtaining a locale name

The pp_locale function needs a pre-existing locale name. It may be empty ("") to use the system's native locale, "C" to use the invariant locale, or any system-provided locale name, which can be found on POSIX systems by running locale -a.

Windows

On Windows, the locale name uses the formats <language>, <language>-<REGION>, <language>-<Script>, or <language>-<Script>-<REGION>. A locale corresponds to a particular code page used when interpreting ANSI text. An overview of common locales and their code pages can be found here.

As an example, the locale name cs-CZ corresponds to the Czech language and regional settings, and uses the encoding Windows-1250.

Linux

On Linux, the set of supported locales can be extended by running the localedef command by using pre-existing language and character mapping definitions.

For example, localedef -i cs_CZ -f CP1250 cs_CZ.CP1250 creates a new locale named cs_CZ.CP1250 using the Windows-1250 encoding.

Trying multiple locale names

When used with a non-existing locale name, pp_locale raises an error. It can be used together with pawn_try_call_native to attempt to set the locale and warn if none is found:

new result;
if(
    (pawn_try_call_native(pawn_nameof(pp_locale), result, "sd", "cs-CZ", locale_ctype) != amx_err_none) &
    (pawn_try_call_native(pawn_nameof(pp_locale), result, "sd", "cs_CZ.cp1250", locale_ctype) != amx_err_none)
)
{
    print("Warning: No character locale data can be set!");
}
Clone this wiki locally