Skip to content

Configuring the locale for language and encoding‐aware operations

IS4 edited this page Jun 29, 2024 · 16 revisions

PawnPlus can take use of the system's cultural settings (the "locale") through mechanisms exposed by std::locale in C++, used for the purposes of formatting and character conversion and comparison.

Overview

When loaded, the plugin sets the global locale (via std::locale::global) to the invariant one (std::locale::classic, commonly identified as "C" or "POSIX") (so any previously-set locale through the server or environment variables will be ignored) and it supports modifying the global locale through pp_locale. It should be noted that other C++ modules may share the same global locale, so this settings affect them as well. The C locale (used by modules in C and set by std::setlocale) is not affected.

The locale can be applied or changed for any number of distinct categories, represented by locale_category:

enum locale_category (<<= 1)
{
    locale_none = 0,
    locale_collate = 1,
    locale_ctype,
    locale_monetary,
    locale_numeric,
    locale_time,
    locale_messages,
    locale_all = -1,
}

These categories affect the following areas of the plugin:

  • locale_collate controls character equivalence and comparisons. It is used only for regular expressions when regex_collate is set. In such a case, character ranges (e.g. [a-z]) will use the order of characters imposed by the locale.
  • locale_ctype specifies character categories (letter, digit, etc.) as well as lowercase and uppercase conversions. It is used for str_to_lower/str_set_to_lower, str_to_upper/str_set_to_upper, and regular expressions, either when character classes like \s or [[:alpha:]] are used, or with regex_icase.
  • locale_numeric defines how numbers are formatted, for example which character is used for the decimal point (e.g. . or ,). It is used by str_format and similar, including tag_op_string and tag_op_format.

To make the script encoding-aware, only locale_ctype is necessary. Some functions also allow entering the encoding manually.

Locale identifier format

Functions taking a locale or encoding use a unified format to identify it:

encoding+locale name;parameters;…|…

The whole identifier consists of |-separated alternatives. If the locale name does not identify a valid system locale, the next alternative is used instead (with its own encoding and parameters).

Encoding may be one of the following:

  • ansi ‒ strings use the narrow (8-bit) character set defined by the locale (commonly referred to as the ANSI encoding). Only character codes 0‒255 are assigned.
  • unicode ‒ strings use the wide character set defined by the locale. This is generally equivalent to utf16 on Windows, and utf32 on Linux.
  • utf8 ‒ strings use the UTF-8 encoding ‒ characters 0―127 are encoded directly, while higher characters are broken into multiple cells taking the range 128―255.
  • utf16 ‒ strings use the UCS-2 or UTF-16 encoding. Only code points 0―0xFFFF are assigned.
  • utf32 ‒ strings use the UTF-32 encoding. If omitted (including the +), it defaults to ansi.

Note that using an encoding other than ansi or unicode outside of str_convert or set_set_convert does not bring any improvement to character manipulation. UTF is implemented only for compatibility in such cases, always resorting back to the system's native unicode support, and character conversions or comparisons are meaningful only for characters occupying a single cell. This has these implications:

  • UTF-8 has access only to ASCII characters (0‒127). Multi-byte characters are opaque to all operations.
  • UTF-16 does not recognize surrogate pairs as single characters, being limited to the Basic Multilingual Plane.
  • UTF-32 on Windows does not recognize any character outside of the BMP either, as no facilities are provided to access such characters.

Parameters are ;-separated options that affect the concrete behaviour of functions. They may be omitted (including the leading ;). They can be one of the following:

  • trunc ‒ this changes the semantics of cells storing characters outside of the range defined by the encoding. By default, undefined characters are treated as opaque by case conversion and comparison functions. When this parameter is set, such cells are truncated to the character bit size. For example, a cell value 0x8800 | 'A' is not treated as a letter by default, but with trunc, it is recognized as a letter, and could be converted to lowercase 0x8800 | 'a'.
  • ucs ‒ switches from UTF-16 to UCS-2-compatible behaviour. In practical terms, this means that surrogate characters are treated as regular characters: when converting from UTF-16 to UTF-8, surrogate pairs take up two characters; when using UTF-32, U+D800' to 'U+DFFF (surrogate pairs) are valid individually.
  • bom ‒ when converting from a Unicode encoding, the byte order mark (BOM) can be encountered (for UTF-16, it may be used to specify the endianness); when converting to a Unicode encoding, the BOM is generated.
  • maxrange ‒ by default, Unicode only defines characters up to U+10FFFF. With this option set, UTF-32 accepts even characters outside this range.
  • fallback=X ‒ sets X as the fallback character, used when conversion fails. By default, ? is used. This character can be only within 0―255.

Only trunc and ucs make sense in pp_locale and can be globally set. The other parameters need to be specified explicitly during each conversion.

Determining the locale name

The locale name used by locale-aware functions needs to be pre-defined. It may be empty ("") to use the system's native locale, "C" to use the invariant locale, or any system-provided locale name, which can be found on POSIX systems by running locale -a.

Windows

On Windows, the locale name uses the formats <language>, <language>-<REGION>, <language>-<Script>, or <language>-<Script>-<REGION>. A locale corresponds to a particular code page used when interpreting ANSI text. An overview of common locales and their code pages can be found here.

As an example, the locale name cs-CZ corresponds to the Czech language and regional settings, and uses the encoding Windows-1250.

A locale can also be identified just by specifying a codepage, in the form .<codepage>, for example .1250 to have a similar effect as above when dealing with encoding.

Linux

On Linux, the set of supported locales can be extended by running the localedef command by using pre-existing language and character mapping definitions.

For example, localedef -i cs_CZ -f CP1250 cs_CZ.CP1250 creates a new locale named cs_CZ.CP1250 using the Windows-1250 encoding.

Trying multiple locale names

When used with a non-existing locale name, pp_locale raises an error. It can be used together with pawn_try_call_native to attempt to set the locale and warn if none is found:

new result;
if(
    (pawn_try_call_native(pawn_nameof(pp_locale), result, "sd", "cs-CZ", locale_ctype) != amx_err_none) &
    (pawn_try_call_native(pawn_nameof(pp_locale), result, "sd", "cs_CZ.cp1250", locale_ctype) != amx_err_none)
)
{
    print("Warning: No character locale data can be set!");
}
Clone this wiki locally