-
Notifications
You must be signed in to change notification settings - Fork 19
Configuring the locale for language and encoding‐aware operations
PawnPlus can take use of the system's cultural settings (the "locale") through mechanisms exposed by std::locale
in C++, used for the purposes of formatting and character conversion and comparison.
When loaded, the plugin sets the global locale (via std::locale::global
) to the invariant one (std::locale::classic
, commonly identified as "C"
or "POSIX"
) (so any previously-set locale through the server or environment variables will be ignored) and it supports modifying the global locale through pp_locale
. It should be noted that other C++ modules may share the same global locale, so this settings affect them as well. The C locale (used by modules in C and set by std::setlocale
) is not affected.
The locale can be applied or changed for any number of distinct categories, represented by locale_category
:
enum locale_category (<<= 1)
{
locale_none = 0,
locale_collate = 1,
locale_ctype,
locale_monetary,
locale_numeric,
locale_time,
locale_messages,
locale_all = -1,
}
These categories affect the following areas of the plugin:
-
locale_collate
controls character equivalence and comparisons. It is used only for regular expressions whenregex_collate
is set. In such a case, character ranges (e.g.[a-z]
) will use the order of characters imposed by the locale. -
locale_ctype
specifies character categories (letter, digit, etc.) as well as lowercase and uppercase conversions. It is used forstr_to_lower
/str_set_to_lower
,str_to_upper
/str_set_to_upper
, and regular expressions, either when character classes like\s
or[[:alpha:]]
are used, or withregex_icase
. -
locale_numeric
defines how numbers are formatted, for example which character is used for the decimal point (e.g..
or,
). It is used bystr_format
and similar, includingtag_op_string
andtag_op_format
.
To make the script encoding-aware, only locale_ctype
is necessary. Some functions also allow entering the encoding manually.
Functions taking a locale or encoding use a unified format to identify it:
encoding+locale name;parameters;…|…
The whole identifier consists of |
-separated alternatives, which are selected in order until the current alternative's locale name identifies a valid system locale (in that case, the encoding and parameters of that alternative are used). Any of the components may be omitted, including their separators (in the case of parameters, the previous ;
must be omitted in that case too).
Encoding may be one of the following:
-
ansi
‒ strings use the narrow (8-bit) character set defined by the locale (commonly referred to as the ANSI encoding). Only codes 0‒255 are assigned. Multi-byte encodings (where a character may be stored in multiple cells) are permitted. -
unicode
‒ strings use the wide character set defined by the locale. This is generally equivalent toutf16
on Windows, andutf32
on Linux. -
utf8
‒ strings use the UTF-8 encoding ‒ characters 0‒127 are encoded directly, while higher characters are broken into multiple cells taking the range 128‒255. -
utf16
‒ strings use the UCS-2 or UTF-16 encoding. Only code units 0‒0xFFFF are assigned. -
utf32
‒ strings use the UTF-32 encoding.
If omitted, it defaults to ansi
.
Note that using an encoding other than ansi
or unicode
outside of str_convert
or set_set_convert
does not bring any improvement to character manipulation. UTF is implemented only for compatibility in such cases, always resorting back to the system's native unicode
support.
Character conversions or comparisons are meaningful only for characters occupying a single cell. This has these implications:
- UTF-8 has access only to ASCII characters (0‒127). Multi-byte characters are opaque to all operations (likewise for all general multi-byte encodings).
- UTF-16 does not recognize surrogate pairs as single characters, being limited to the Basic Multilingual Plane.
- UTF-32 on Windows does not recognize any character outside of the BMP either, as no facilities are provided to access such characters.
Parameters are ;
-separated options that affect the concrete behaviour of functions. They can be one of the following:
-
trunc
‒ this changes the semantics of cells storing characters outside of the range defined by the encoding. By default, such cells are treated as opaque by case conversion and comparison functions. When this parameter is set, they are truncated to the code unit bit size. For example, a cell value0x8800 | 'A'
is not treated as a letter by default in the ANSI encoding in theC
locale, but with;trunc
, it is recognized as a letter, and may be converted to lowercase0x8800 | 'a'
. -
ucs
‒ switches from UTF-16 to UCS-2-compatible behaviour. In practical terms, this means that surrogate characters are treated as regular characters: when converting from UTF-16 to UTF-8, surrogate pairs take up two characters; when using UTF-32,U+D800' to 'U+DFFF
(surrogate pairs) are valid individually. -
bom
‒ when converting from a Unicode encoding, the byte order mark (BOM) is recognized (for UTF-16, it may be used to specify the endianness); when converting to a Unicode encoding, the BOM is generated. -
maxrange
‒ by default, only defines characters up to U+10FFFF, as defined by Unicode, are permitted. With this option set, UTF-32 accepts even characters outside this range. -
native
‒ for conversion between UTF-8 and UTF-16/32, use the locale to perform the conversion instead of a unified implementation. There is generally no reason to use this parameter, since there should be no locale-based variations in the encoding, and other Unicode-affecting parameters may not be respected for the conversion. -
fallback=X
‒ sets X as the fallback character, used when conversion fails. By default,?
is used. This character can be only within 0‒255.
Only trunc
and ucs
make sense in pp_locale
and can be globally set. The other parameters need to be specified explicitly during each conversion.
The locale name used by locale-aware functions needs to be pre-defined. It may be empty (""
) to use the default locale, "C"
to use the invariant locale, or any other system-provided locale name, which can be found on POSIX systems by running locale -a
.
On Windows, the locale name uses the formats <language>
, <language>-<REGION>
, <language>-<Script>
, or <language>-<Script>-<REGION>
. A locale corresponds to a particular code page used when interpreting ANSI text. An overview of common locales and their code pages can be found here.
As an example, the locale name cs-CZ
corresponds to the Czech language and regional settings, and uses the encoding Windows-1250.
A locale can also be identified just by specifying a codepage, in the form .<codepage>
, for example .1250
can be used to a similar effect as above when dealing with encodings.
On Linux, the set of supported locales can be extended by running the localedef
command by using pre-existing language and character mapping definitions.
For example, localedef -i cs_CZ -f CP1250 cs_CZ.CP1250
creates a new locale named cs_CZ.CP1250
using the Windows-1250 encoding.
cs_CZ.cp1250|cs-CZ
- A locale identifier with two alternatives. If `cs_CZ.cp1250` is not found, `cs-CZ` is attempted next.