Skip to content

Configuring the locale for language and encoding‐aware operations

IS4 edited this page Jul 3, 2024 · 16 revisions

PawnPlus can take use of the system's cultural settings (the "locale") through mechanisms exposed by std::locale in C++, used for the purposes of formatting and character conversion and comparison.

Overview

When loaded, the plugin sets the global locale (via std::locale::global) to the invariant one (std::locale::classic, commonly identified as "C" or "POSIX") (so any previously-set locale through the server or environment variables will be ignored) and it supports modifying the global locale through pp_locale. It should be noted that other C++ modules may share the same global locale, so this settings affect them as well. The C locale (used by modules in C and set by std::setlocale) is not affected.

The locale can be applied or changed for any number of distinct categories, represented by locale_category:

enum locale_category (<<= 1)
{
    locale_none = 0,
    locale_collate = 1,
    locale_ctype,
    locale_monetary,
    locale_numeric,
    locale_time,
    locale_messages,
    locale_all = -1,
}

These categories affect the following areas of the plugin:

  • locale_collate controls character equivalence and comparisons. It is used only for regular expressions when regex_collate is set. In such a case, character ranges (e.g. [a-z]) will use the order of characters imposed by the locale.
  • locale_ctype specifies character categories (letter, digit, etc.) as well as lowercase and uppercase conversions. It is used for str_to_lower/str_set_to_lower, str_to_upper/str_set_to_upper, and regular expressions, either when character classes like \s or [[:alpha:]] are used, or with regex_icase.
  • locale_numeric defines how numbers are formatted, for example which character is used for the decimal point (e.g. . or ,). It is used by str_format and similar, including tag_op_string and tag_op_format.

To make the script encoding-aware, only locale_ctype is necessary. Some functions also allow entering the encoding manually.

Locale identifier format

Functions taking a locale or encoding use a unified format to identify it:

encoding+locale name;parameters;…|…

The whole identifier consists of |-separated alternatives, which are selected in order until the current alternative's locale name identifies a valid system locale (in that case, the encoding and parameters of that alternative are used). Any of the components may be omitted, including their separators (in the case of parameters, the previous ; must be omitted in that case too).

Encoding

Encoding may be one of the following:

  • ansi ‒ strings use the narrow (8-bit) character set defined by the locale (commonly referred to as the ANSI encoding). Only codes 0‒255 are assigned. Multi-byte encodings (where a character may be stored in multiple cells) are permitted.
  • unicode ‒ strings use the wide character set defined by the locale. This is generally equivalent to utf16 on Windows, and utf32 on Linux.
  • utf8 ‒ strings use the UTF-8 encoding ‒ characters 0‒127 are encoded directly, while higher characters are broken into multiple cells taking the range 128‒255.
  • utf16 ‒ strings use the UCS-2 or UTF-16 encoding. Only code units 0‒0xFFFF are assigned.
  • utf32 ‒ strings use the UTF-32 encoding.

If omitted, it defaults to ansi.

Note that using an encoding other than ansi or unicode outside of str_convert or set_set_convert does not bring any improvement to character manipulation. UTF is implemented only for compatibility in such cases, always resorting back to the system's native unicode support.

Character conversions or comparisons are meaningful only for characters occupying a single cell. This has these implications:

  • UTF-8 has access only to ASCII characters (0‒127). Multi-byte characters are opaque to all operations (likewise for all general multi-byte encodings).
  • UTF-16 does not recognize surrogate pairs as single characters, being limited to the Basic Multilingual Plane.
  • UTF-32 on Windows does not recognize any character outside of the BMP either, as no facilities are provided to access such characters.

Parameters

Parameters are ;-separated options that affect the concrete behaviour of functions. They can be one of the following:

  • trunc ‒ this changes the semantics of cells storing characters outside of the range defined by the encoding. By default, such cells are treated as opaque by case conversion and comparison functions. When this parameter is set, they are truncated to the code unit bit size. For example, a cell value 0x8800 | 'A' is not treated as a letter by default in the ANSI encoding in the C locale, but with ;trunc, it is recognized as a letter, and may be converted to lowercase 0x8800 | 'a'.
  • ucs ‒ switches from UTF-16 to UCS-2-compatible behaviour. In practical terms, this means that surrogate characters are treated as regular characters: when converting from UTF-16 to UTF-8, surrogate pairs take up two characters; when using UTF-32, U+D800' to 'U+DFFF (surrogate pairs) are valid individually.
  • bom ‒ when converting from a Unicode encoding, the byte order mark (BOM) is recognized (for UTF-16, it may be used to specify the endianness); when converting to a Unicode encoding, the BOM is generated.
  • maxrange ‒ by default, only defines characters up to U+10FFFF, as defined by Unicode, are permitted. With this option set, UTF-32 accepts even characters outside this range.
  • native ‒ for conversion between UTF-8 and UTF-16/32, use the locale to perform the conversion instead of a unified implementation. There is generally no reason to use this parameter, since there should be no locale-based variations in the encoding, and other Unicode-affecting parameters may not be respected for the conversion.
  • fallback=X ‒ sets X as the fallback character, used when conversion fails. By default, ? is used. This character can be only within 0‒255.

Only trunc and ucs make sense in pp_locale and can be globally set. The other parameters need to be specified explicitly during each conversion.

Determining the locale name

The locale name used by locale-aware functions needs to be pre-defined. It may be empty ("") to use the default locale, "C" to use the invariant locale, or any other system-provided locale name, which can be found on POSIX systems by running locale -a.

Windows

On Windows, the locale name uses the formats <language>, <language>-<REGION>, <language>-<Script>, or <language>-<Script>-<REGION>. A locale corresponds to a particular code page used when interpreting ANSI text. An overview of common locales and their code pages can be found here.

As an example, the locale name cs-CZ corresponds to the Czech language and regional settings, and uses the encoding Windows-1250.

A locale can also be identified just by specifying a codepage, in the form .<codepage>, for example .1250 can be used to a similar effect as above when dealing with encodings.

Linux

On Linux, the set of supported locales can be extended by running the localedef command by using pre-existing language and character mapping definitions.

For example, localedef -i cs_CZ -f CP1250 cs_CZ.CP1250 creates a new locale named cs_CZ.CP1250 using the Windows-1250 encoding.

Examples

cs_CZ.cp1250|cs-CZ
A locale identifier with two alternatives. If `cs_CZ.cp1250` is not found, `cs-CZ` is attempted next.
Clone this wiki locally