Resolve z_charset confusion and byte-swapping issue.

This is probably a better venue for discussion than `-c davidben`. Alright, so a while ago there was some ramblings on `-c zephyr-dev` about the `z_charset` field and how messed up it is. From memory, here's a summary of the situation:
- Zephyr 3 added a charset field to notices with the character set of the message. It can have values UTF-8, ISO-8859-1, and UNKNOWN.
- Prior to that, zephyrgrams didn't really have any character set associated with them.
- Old zwrite does not set the charset field and just dumps the bytes it receives over the wire.
- BarnOwl ignores the charset field on receive and sniffs for valid UTF-8. Yes => UTF-8, no => ISO-8859-1.
- New zwgc interprets the charset field and converts accordingly before displaying the notice.
- Empirically, from trying to do the right in Roost, there exist senders which tag as ISO-8859-1 and send as UTF-8. I think it was mostly bots, but I forget if any humans managed it too. I had to back out of doing it in Roost. Roost currently blindly assumes all messages are UTF-8 which seems to work pretty much okay, though it should grow BarnOwl's sniffing logic as I have seen ISO-8859-1 messages in the wild. (Unfortunately, I'm dumb and fail to save either the z_charset field and the original bytes, so we don't have historical data here.)

In addition, I discovered a new issue today. z_charset is endian-confused over the wire! This line (and a corresponding one for formatting notices) shouldn't be there.
https://github.com/zephyr-im/zephyr/blob/master/lib/ZParseNot.c#L292

So, to add to our situation list:
- Messages from a big-endian machine received on a little-endian machine and vice versa see the charset fields byte-swapped from each other.
- I assert the vast majority of zephyr senders and receivers are on little-endian machines.
- But there do exist multics.mit.edu users and perhaps others.

This is a mess. It should get resolved.

So, I'm uneasy about switching Roost back over to assuming ISO-8859-1-tagged messages are actually telling the truth because I've been burned by that before. I also think protocols should minimize variability for the sake of sanity. (And for entirely selfish reasons that I'm working on a new from-scratch implementation and don't want more test vectors in my unit tests.) Here are two proposals I think I would be happy with to start things off:
##### Proposal davidben-there-is-no-multics
- UTF-8 is the One True Encoding.
- From now on, the correct encoding of the charset field is little-endian. Change the `htons` calls in `libzephyr` to something that byteswaps on big or little endian.
- All new senders send UTF-8 over the wire and write `ZCHARSET_UTF_8` into the charset field. If a sender doesn't know whether its input is UTF-8 or not, use `ZCHARSET_UNKNOWN` and make loud noises.
- All new receivers, when receiving a message:
  - If the charset field is `ZCHARSET_UTF_8`, assume the message is UTF-8.
  - If the charset field is missing or any other value, sniff. If valid UTF-8, it's UTF-8. Otherwise, it's ISO-8859-1.
- Senders and receivers dealing with non-UTF-8 have the responsibility to transcode. UTF-8-only senders and UTF-8-only receivers should not care about other encodings apart from the sniff. When the few senders producing non-UTF-8 get fixed, we can move to blindly assuming UTF-8.
  - Unfortunately, non-UTF-8 locale is not enough for `zwrite` to transcode. Presumably the people sending UTF-8-tagged ISO-8859-1 have some confused configuration that would also confuse the new `zwrite` too. To avoid introducing problems when they upgrade, `zwrite` does NOT tag with `ZCHARSET_UTF_8` if the system is on a non-ISO-8859-1 locale unless `-x` is explicitly passed and/or maybe some environment variable. Maybe print an angry message to `stderr` or something so we can get those setups fixed.
  - https://github.com/zephyr-im/zephyr/pull/132 should be sufficient to deal with this.
##### Proposal davidben-okay-maybe-multics-exists

Same as above but replace "`ZCHARSET_UTF_8`" in the receiver section with "`ZCHARSET_UTF_8` or `byteswap16(ZCHARSET_UTF_8)`". Big-endian senders still follow the rule about little-endian being the correct encoding. Transition back to **davidben-there-is-no-multics** when all big-endian machines are updated.

When shiny new Roost finally happens, we can get data on when and how often the backwards-compatibility cases occur to guide when we can drop them.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resolve z_charset confusion and byte-swapping issue. #127

Proposal davidben-there-is-no-multics

Proposal davidben-okay-maybe-multics-exists

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resolve z_charset confusion and byte-swapping issue. #127

Description

Proposal davidben-there-is-no-multics

Proposal davidben-okay-maybe-multics-exists

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions