Skip to content

Character encoding is scrambled and truncated #64

@Mirabel-le-Grand

Description

@Mirabel-le-Grand

Would it be possible to fix the scrambled and deficient character encoding of wpunix? All other WordPerfect versions that I use, including versions 6.0 and 6.2 for DOS, 8 for Windows, and 8.1 for Linux use an identical character encoding scheme (without any difference, as far as I have discerned, other than the change of 4,72 to the Euro sign). While the common Latin letters and basic punctuation characters in wpunix mostly follow the standard encoding, the portability of documents with any amount of extended punctuation, technical symbols, or other alphabets is severely crippled. Opening documents in wpunix that I created in other versions of WordPerfect and vice versa results in mojibake. Amazingly, this faulty character encoding made it into the WordPerfect for UNIX User's Guide (Appendix P, starting on page 721) without any of the developers noticing.

Methods

Since I use technical symbols in my documents, I made a table comparing the mappings from /opt/wp80/wplib/charactrc.tst and CHARACTR.DOC (included with version 6.2 for DOS) and tested them in wpunix. I have corrected it to reflect the aforementioned versions of WordPerfect and updated it with the correct Unicode characters that have been encoded since the 1990s.

Observations

  • Character sets 0 (ASCII), 3 (Box Drawing), and 7 (Math/Scientific Extension) are complete.
  • Character set 1 (Multinational): The last 8 characters from 1,234 to 1,241 are missing and cannot be entered at all, even by Control+v or Font -> Characters.
  • Character set 2 (Phonetic): This set has 28, randomly selected, characters of the 144 characters of the standard character set. None of the character codes match the standard set. Without the missing characters, these are useless for any representation of IPA.
  • Character set 4 (Typographic Symbols): The last 17 characters are missing.
  • Character set 5 (Iconic Symbols): Only 35 of the 254 standard characters are included. The character codes of four of them do not match the standard set.
  • Character set 6 (Math/Scientific): The last three characters are missing.
  • Character set 8 (Greek): Only the first 52 character codes match. The order of all the extended characters, including letters with the tonos and dialytika diacritics for monotonic Greek, and all letters with polytonic diacritics, is scrambled. Key characters needed for both orthographies are missing or have character names that do not match either the coded character or the standard encoding. This character set is unsuitable for either monotonic or polytonic Greek.
  • Character set 9 (Hebrew): Only the first 44 characters match the standard character set. Only 12 of the cantillation marks are included, but their names do not match the encoded characters, and they are useless without the missing cantillation marks. The extended Hebrew characters are also missing.
  • Character set 10 (Cyrillic): Only the letters of the basic Russian alphabet match those of the standard character set. All the other characters, required for Old Church Slavonic, Slavic, Turkic, Uralic, and other languages, are in a scrambled order and do not match the standard codes. Many key characters are missing. This character set is not suited for all Cyrillic alphabets other than basic Russian.
  • Character set 11 (Japanese Kana): This is the only character set to include more characters than the standard set (e.g.: adding hiragana and precomposed dakuten and handakuten). Not one of the characters in this set, however, matches the standard encoding.
  • Character sets 13 (Arabic 1) and 14 (Arabic Script) are absent.

Conclusions

I think that the idiosyncratic character encoding of wpunix was botched rather than a result of technical limitation or intention. This is evidenced by its departure from the standard followed by other version of WordPerfect, haphazard omission of key characters, illogical re-orderings, and internally contradicting character names.

  1. I think that the character encoding of wpunix should be modified to follow the standard used across other versions of WordPerfect. Symbol insertion (via system keyboard entry, Control+v, and Font -> Characters) and importation of Unicode text and its proper conversion to the standard WordPerfect encoding should be assured.
  2. WPunix should now display more WordPerfect characters in the terminal. Many extended characters are substituted in the terminal display by unaccented letters or squares. More sophisticated terminals, expansions to Unicode, and better terminal display fonts since the 1990s means that more characters should display in the interface, including those represented by Unicode as base letters with combining diacritics.
  3. I would happily share my conversion table to assist such an effort. It would also be useful to those using Unicode fonts in WordPerfect, editing printer drivers, or converting documents. I am new to Github: perhaps a collaboratively edited version could be hosted here?

It would be awesome to use WordPerfect in a character-based terminal like that of DOS, but free from the constraints of 8-bit code pages and utilising my custom Linux input methods.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions