Skip to content

ENH: CID font resource from font file to encode more characters#3652

Draft
PJBrs wants to merge 22 commits intopy-pdf:mainfrom
PJBrs:fontwork
Draft

ENH: CID font resource from font file to encode more characters#3652
PJBrs wants to merge 22 commits intopy-pdf:mainfrom
PJBrs:fontwork

Conversation

@PJBrs
Copy link
Contributor

@PJBrs PJBrs commented Feb 19, 2026

This PR adds a new method to _font.py, from_truetype_font_file, which initialises a Font instance from an embedded font file. I'm assuming that this might also work with a real file. Furthermore, it adds a lot of information to as_font_resource, to enable producing a CID TrueType font resource that enables encoding more characters than a TrueType font resource.

This fixes #3361.

Contributes to fixing #3514.

Might be related to #3318. EDIT, it is not.

Includes all work from #3602.

EDIT.

How it works:
We detect if a text value for a text widget annotation can be encoded using an existing font resource. If not, and we have an embedded TrueType font, we assume that we are expected to create a new font resource. We use the embedded font file to initialise a new Font instance, and then produce a new font resource from this instance. After having done so, we make the associated font descriptor an indirect object later on, as per the PDF specification.

Some notes:
I think that the more elegant way would be produce a short embedded font resource with only the characters in the text value. Also, it should have been possible to reuse the original font descriptor, but I can't seem to make that work.

@PJBrs PJBrs marked this pull request as draft February 19, 2026 16:43
@PJBrs PJBrs force-pushed the fontwork branch 2 times, most recently from 175a542 to e43c57d Compare February 21, 2026 13:45
@PJBrs PJBrs marked this pull request as ready for review February 21, 2026 14:54
@codecov
Copy link

codecov bot commented Feb 21, 2026

Codecov Report

❌ Patch coverage is 90.85714% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.24%. Comparing base (1aef6fb) to head (03810a0).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
pypdf/_font.py 91.30% 5 Missing and 5 partials ⚠️
pypdf/_writer.py 88.63% 5 Missing ⚠️
pypdf/generic/_appearance_stream.py 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3652      +/-   ##
==========================================
- Coverage   97.36%   97.24%   -0.12%     
==========================================
  Files          55       55              
  Lines        9937    10093     +156     
  Branches     1820     1848      +28     
==========================================
+ Hits         9675     9815     +140     
- Misses        152      162      +10     
- Partials      110      116       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 21, 2026

This pull request is now ready for review. It seems to have failed some tests, but since it passed these earlier, I'm going to assume that that's a fluke.

Codecov shows that quite some new code is not covered by tests. This is mostly because I tried to parse all sources for applicable font flags in the font descriptor, and the file that I tested has only one font. To really test this code, we should read multiple real truetype fonts from file to see if they parse correctly. That, however, would seem, to me, to be beyond the purposes of this PR. Conversely, it would seem a shame to me not to parse these flags. How should I continue?

One final thing:

NameObject("/Registry"): TextStringObject("Adobe"),  # Should be something read from font file

I can also still improve this, if wanted.

@PJBrs PJBrs force-pushed the fontwork branch 2 times, most recently from 160d8d5 to cf9b10e Compare February 21, 2026 15:50
@PJBrs PJBrs marked this pull request as draft February 22, 2026 10:34
@PJBrs PJBrs force-pushed the fontwork branch 3 times, most recently from 5b3cd93 to cbc9ee4 Compare February 22, 2026 11:26
@PJBrs PJBrs marked this pull request as ready for review February 22, 2026 11:40
@PJBrs PJBrs marked this pull request as draft February 22, 2026 19:07
@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 24, 2026

@stefan6419846

I clearly must need to learn more about fonts in order to get this PR sufficient. I've learnt the following now:
In CID fonts, one Unicode code point may refer to various different glyphs, especially in Arabic. the Font class, however, just maps widths to one Unicode code point, which means that it can only store the width for one character variant. So, in fact, ideally a Font should map character codes to GIDs, where one character code EDIT NO, ONE UNICODE CODE POINT might refer to multiple GIDs. For non-CID fonts, we just map unicode points as a fallback. In both cases, character_widths should be keyed by the values of character_map. EDIT THIS IS INCORRECT, CHARACTER_WIDTHS SHOULD BE KEYED BY THE KEYS OF CHARACTER_MAP And this ought to be sufficient for both text extraction and producing appearance streams.

What we have in character_map actually is pypdf's representation of a /ToUnicode character mapping.

Reflection by Google Gemini:

  1. The Ideal Mapping Structure
    To be "complete," your Font object needs a hierarchy that recognizes that a character is an idea, but a glyph is ink.

Logical Level: character_map should be dict[str, list[int]]. It maps one Unicode character (the "idea") to one or more GIDs (the "ink").

Physical Level: character_widths should be keyed by GID (int), not by the character string.

  1. Why this solves both problems
    Goal How your "Ideal" logic handles it
    Appearance Streams You use a shaper (or logic) to pick the correct GID from the list. Since your widths are keyed by GID, you get the exact pixel-perfect spacing for that specific shape.
    Text Extraction You use the /ToUnicode map (which you build by reversing your GID map). When the viewer sees a GID, it looks up which "Idea" (Character) it belongs to.
  1. Non-CID Fonts (The Fallback)
    For "Simple" fonts (Type 1 or standard TrueType with WinAnsi), the character_map is effectively a 1-to-1 dictionary. In this case, your list[int] just happens to have a length of 1. By treating everything as a list of potential GIDs, your code becomes universal.

  2. The "Key" Realization
    You said: "character_widths should be keyed by the values of character_map."

Bingo. If character_map values are GIDs (integers), and character_widths keys are GIDs (integers), you have a bulletproof system.

If the user enters Arabic Meem, your logic knows that for this specific instance, you are using GID 502.

You look up character_widths[502].

The appearance stream stays perfect.

The /ToUnicode map still tells the PDF viewer that GID 502 = "Meem", so copy-paste still works.

The "pypdf" Disconnect
The reason the pypdf code you shared looks "incorrect" to you is that it tries to force the GID back into a Character key.

It does character_widths[ord_map[cid]] = width.

If ord_map[501], 502, and 503 all point to the same character, they keep overwriting each other in that dictionary.

Result: pypdf loses the unique widths of the contextual shapes.

Your Path Forward
To implement this "Slowly but Completely," you should modify your Font class to store:

unicode_to_gids: dict[str, list[int]]

widths: dict[int, int] (Keyed by GID)

@stefan6419846
Copy link
Collaborator

I am clearly no font expert either, thus having someone who really likes to dig into the oddities here is highly appreciated. From a pypdf point of view, having this would be great, but for now, we (luckily) do not expose this functionality as public API, thus preventing us from having to consider backwards-compatibility as well.

@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 26, 2026

I am clearly no font expert either, thus having someone who really likes to dig into the oddities here is highly appreciated. From a pypdf point of view, having this would be great, but for now, we (luckily) do not expose this functionality as public API, thus preventing us from having to consider backwards-compatibility as well.

OK, I'll forge on then, using this PR for note taking. I'm finding now that it's not even very wrong, it's just overly simplistic.

What I've learned in the interim:

Font.encoding and Font.character_map can be said to map the same thing: decoded text (int in the context of simple fonts, str in the context of CID fonts) to unicode code points. EDIT LIKE THE /ToUnicode DICT IN A FONT RESOURCE This is helpful for extracting text from pdfs and it can also map different glyph substitutes to single unicode code points, e.g., for Arabic text. For this reason, character_widths actually is wrong, because it maps unicode code points to widths, which does not work if multiple glyphs map to the same unicode code point (e.g., Arabic).

Furthermore, it doesn't really work for the reverse logic of producing text. In this PR, I think that I populated character_map in from_truetype_font_file in reverse, mapping unicode code points to character IDs. Otherwise, it actually seems to work, with the caveats that the character_widths are lossy when one unicode code point maps to multiple glyphs.

So, for the purposes of abstraction I could actually merge map_dict and encoding without losing any information or functionality:

Technically, you can merge them into a single abstraction, but with one critical architectural "gotcha": Collision Handling.

In a merged structure, you are essentially creating a unified Character Code → Unicode lookup table. However, because encoding and map_dict represent two different layers of the PDF spec, merging them requires a specific priority logic.

The Unified Abstraction
If you merge them, your new structure would look like this:

Why you have to be careful
The PDF specification (specifically §5.9.1 in version 1.7) states that if a /ToUnicode map exists, it supersedes the encoding for those specific characters.

If you simply combine them into one dictionary, you must ensure the merge follows these rules:

Type Consistency: encoding uses integers, while map_dict uses strings (often chr(x) for 1-byte codes). You would need to decide on a consistent key type—likely strings—to handle both simple 8-bit codes and multi-byte CID codes.

The "Identity" Problem: In some CID fonts (Identity-H), the "character code" is actually a Glyph ID (GID). In these cases, the encoding is often just a dummy "Identity" map, and the map_dict is the only source of truth. Merging them blindly might lead to using a raw GID as a character if the map_dict is missing an entry.

The Overlap: As seen in the get_encoding function in your snippet:

The code already attempts a form of "syncing" between the two.

The Verdict
Yes, you can merge them, provided your abstraction follows a "Shadowing" pattern:

Initialize your map with the encoding (Base).

Overwrite/Update with map_dict (ToUnicode).

Ensure all keys are converted to a common type (e.g., str representing the raw byte sequence).

By doing this, you've essentially created a Virtual Font Map. This is actually how many high-level PDF text extractors (like pdfplumber or fitz/PyMuPDF) handle it internally to simplify the text reconstruction process.

It would seem to that the underlying pypdf font / encoding / character_map architecture can be improved in three ways:

  1. Merge encoding and character_map to one attribute.
  2. Add a reverse character_map
  3. Have character_widths map all glyphs <-- This information might already have been present in the code that I removed earlier when merging all font code. sigh Then again, it wasn't as essential for text extraction, and reverting the logic shouldn't be too hard.

For this PR, it doesn't matter too much, I can just clean up the logic in character_map and then it should work for fonts where the CIDtoGID map is contiguous (this is another caveat).

@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 26, 2026

@stefan6419846 OK, final verdict for now - character_widths need to be keyed by CID for cid fonts or by character code for simple fonts. My decision to use the character widths code from the new Font class was unfortunately incorrect. I can still fix the above PR, I think, at least logically, and it will also mostly work, but not for any text that needs to be run through a text shaper.

(I now, finally, understand that many fonts contain glyphs without a unicode code point, such as ligatures. You cannot address these using a unicode code point, and you also cannot get their widths through a unicode code point. Instead, you need to read their widths and glyphs by CID (for CID fonts) / character code (for simple fonts). This was what the old build_font_width_map code did in _cmap.py.)

I'll fix this PR according to the new logic, but then I'll revert the character widths to the old logic that was in _cmap.py, port the old text extraction code back to it (should be simple) and port the layout text extraction code and the appearance stream code to it (will be harder). And that will be a good basis for generating arabic code.

@PJBrs PJBrs force-pushed the fontwork branch 2 times, most recently from 6919ab5 to 744c593 Compare February 26, 2026 21:59
@PJBrs PJBrs marked this pull request as ready for review February 26, 2026 21:59
@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 28, 2026

Something's still weird, font is not listed as embedded...

@PJBrs PJBrs marked this pull request as draft February 28, 2026 16:59
@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 1, 2026

OK, I may have reached somewhat of a breakthrough. I can now fully embed a font and associated font resource and encode new text. I needed to add a character_map after all, but not in the way that I thought.

I can also do so while reusing a compatible font resource. Main remaining problems include:

  • I'm adding a new font resource for every annotation, and that really slows things down
  • Loads of other stuff that I can't quite remember right now.

@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 1, 2026

I think that this PR now starts to get somewhat useful. As far as I can tell, it now no longer matters whether I create a new /FontDescriptor resource or use the old one. Also, visual text now corresponds with copy-pasted text.

I'm going to change the api a little bit so that I can actually embed a font from a ttf file using writer.add_font(). In that way, it is easier to test the new font methods and different encodings.

Tests all fail because I didn't fix the new test after adding a character_map. Probably needs another week of work.

@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 2, 2026

I fixed the test, now I'm just filling the form and then extracting the text using PdfReader. This is nice, because nothing changed in the code for text extraction, which means that the new code would be logically the same as the old.

PJBrs added 11 commits March 4, 2026 15:24
This enables generating a new unicode font resource in case of
text widget values that cannot be encoded with existing font
resources.
This patch adds a method to produce a pdf font descriptor resource.
For now, we assume that an embedded font file will be a TrueType font.
This patch adds back some code that got removed earlier and,
at that time, did not see any test coverage. With new code
that enables adding fonts, I've finally understood that, in
some cases, a -1 key will be added to font.character_map.
This will cause an encoding failure when generating
font_glyph_byte_map.
@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 6, 2026

@stefan6419846 Apologies for the spam. I'm still trying to get all code coverage done, which is difficult for the font flags. That said, I've temporarily changed _page.py to initialise fonts from embedded font files in case they are present when extracting text, and all but one tests appear to pass! I think that this means that the approach I've followed so far is actually rather solid.

Other than fixing the tests, one thing remains - I think I'm now not adding a font resource if that would overwrite an existing font resource, so I should add something to change the font_name when I'm adding a font.

Also, if you would, please comment on the writer.add_font() method. It's not strictly necessary, but it's fun!

P.S., multiple tests fail when I'm using embedded font file information for text extraction instead of pdf font resources.. I've omitted all tests that fail because of problems with the embedded font file. The following tests still fail:

FAILED tests/test_reader.py::test_extract_text_hello_world - AssertionError: assert ['Ă\x03\x04\x...x02\x08', ...] == ['English:', ...ет, мир', ...]
FAILED tests/test_text_extraction.py::test_multi_language[None] - AssertionError: English not correctly extracted
FAILED tests/test_text_extraction.py::test_multi_language[<lambda>] - AssertionError: English not correctly extracted
FAILED tests/test_text_extraction.py::test_rotated_layout_mode - AssertionError: Contents should be in expected layout
FAILED tests/test_workflows.py::test_text_extraction_encrypted - AssertionError: assert False
FAILED tests/test_writer.py::test_new_removes - assert 'Arbeitsschritt' in '\x01\x02\x03Ѕ\x03\x06\x07\x08\x06\x04\t\x03\x05\n\x05\x0b\x05\x0c\x0c\r\x08ȇĂ\x03Ѕ\x03\x06\x07\x08\x06\r\x0e\r\x06\x03\x05\x04\x04\x02\r\x03ȉ\x0f\x03\x0f\n\x07\x03\n\x01ထሓ\x14\...

So, something's not quite right. Then again, the goal of this PR is not to always use an embedded font file for text extraction. And on a more positive note, given how many text extractions pass, I'm now more confident that the character_map derived from a font file is correct, and ought to be compatible with existing code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Corrupted unicode characters in form field

2 participants