-
Notifications
You must be signed in to change notification settings - Fork 10
Description
This is probably a better venue for discussion than -c davidben
. Alright, so a while ago there was some ramblings on -c zephyr-dev
about the z_charset
field and how messed up it is. From memory, here's a summary of the situation:
- Zephyr 3 added a charset field to notices with the character set of the message. It can have values UTF-8, ISO-8859-1, and UNKNOWN.
- Prior to that, zephyrgrams didn't really have any character set associated with them.
- Old zwrite does not set the charset field and just dumps the bytes it receives over the wire.
- BarnOwl ignores the charset field on receive and sniffs for valid UTF-8. Yes => UTF-8, no => ISO-8859-1.
- New zwgc interprets the charset field and converts accordingly before displaying the notice.
- Empirically, from trying to do the right in Roost, there exist senders which tag as ISO-8859-1 and send as UTF-8. I think it was mostly bots, but I forget if any humans managed it too. I had to back out of doing it in Roost. Roost currently blindly assumes all messages are UTF-8 which seems to work pretty much okay, though it should grow BarnOwl's sniffing logic as I have seen ISO-8859-1 messages in the wild. (Unfortunately, I'm dumb and fail to save either the z_charset field and the original bytes, so we don't have historical data here.)
In addition, I discovered a new issue today. z_charset is endian-confused over the wire! This line (and a corresponding one for formatting notices) shouldn't be there.
https://github.com/zephyr-im/zephyr/blob/master/lib/ZParseNot.c#L292
So, to add to our situation list:
- Messages from a big-endian machine received on a little-endian machine and vice versa see the charset fields byte-swapped from each other.
- I assert the vast majority of zephyr senders and receivers are on little-endian machines.
- But there do exist multics.mit.edu users and perhaps others.
This is a mess. It should get resolved.
So, I'm uneasy about switching Roost back over to assuming ISO-8859-1-tagged messages are actually telling the truth because I've been burned by that before. I also think protocols should minimize variability for the sake of sanity. (And for entirely selfish reasons that I'm working on a new from-scratch implementation and don't want more test vectors in my unit tests.) Here are two proposals I think I would be happy with to start things off:
Proposal davidben-there-is-no-multics
- UTF-8 is the One True Encoding.
- From now on, the correct encoding of the charset field is little-endian. Change the
htons
calls inlibzephyr
to something that byteswaps on big or little endian. - All new senders send UTF-8 over the wire and write
ZCHARSET_UTF_8
into the charset field. If a sender doesn't know whether its input is UTF-8 or not, useZCHARSET_UNKNOWN
and make loud noises. - All new receivers, when receiving a message:
- If the charset field is
ZCHARSET_UTF_8
, assume the message is UTF-8. - If the charset field is missing or any other value, sniff. If valid UTF-8, it's UTF-8. Otherwise, it's ISO-8859-1.
- If the charset field is
- Senders and receivers dealing with non-UTF-8 have the responsibility to transcode. UTF-8-only senders and UTF-8-only receivers should not care about other encodings apart from the sniff. When the few senders producing non-UTF-8 get fixed, we can move to blindly assuming UTF-8.
- Unfortunately, non-UTF-8 locale is not enough for
zwrite
to transcode. Presumably the people sending UTF-8-tagged ISO-8859-1 have some confused configuration that would also confuse the newzwrite
too. To avoid introducing problems when they upgrade,zwrite
does NOT tag withZCHARSET_UTF_8
if the system is on a non-ISO-8859-1 locale unless-x
is explicitly passed and/or maybe some environment variable. Maybe print an angry message tostderr
or something so we can get those setups fixed. - zwrite: Assume UTF-8 rather than ISO-8859-1 in an ASCII locale #132 should be sufficient to deal with this.
- Unfortunately, non-UTF-8 locale is not enough for
Proposal davidben-okay-maybe-multics-exists
Same as above but replace "ZCHARSET_UTF_8
" in the receiver section with "ZCHARSET_UTF_8
or byteswap16(ZCHARSET_UTF_8)
". Big-endian senders still follow the rule about little-endian being the correct encoding. Transition back to davidben-there-is-no-multics when all big-endian machines are updated.
When shiny new Roost finally happens, we can get data on when and how often the backwards-compatibility cases occur to guide when we can drop them.