ENH: CID font resource from font file to encode more characters#3652
ENH: CID font resource from font file to encode more characters#3652PJBrs wants to merge 22 commits intopy-pdf:mainfrom
Conversation
175a542 to
e43c57d
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3652 +/- ##
==========================================
- Coverage 97.36% 97.24% -0.12%
==========================================
Files 55 55
Lines 9937 10093 +156
Branches 1820 1848 +28
==========================================
+ Hits 9675 9815 +140
- Misses 152 162 +10
- Partials 110 116 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request is now ready for review. It seems to have failed some tests, but since it passed these earlier, I'm going to assume that that's a fluke. Codecov shows that quite some new code is not covered by tests. This is mostly because I tried to parse all sources for applicable font flags in the font descriptor, and the file that I tested has only one font. To really test this code, we should read multiple real truetype fonts from file to see if they parse correctly. That, however, would seem, to me, to be beyond the purposes of this PR. Conversely, it would seem a shame to me not to parse these flags. How should I continue? One final thing: I can also still improve this, if wanted. |
160d8d5 to
cf9b10e
Compare
5b3cd93 to
cbc9ee4
Compare
|
I clearly must need to learn more about fonts in order to get this PR sufficient. I've learnt the following now: What we have in character_map actually is pypdf's representation of a /ToUnicode character mapping. Reflection by Google Gemini:
|
|
I am clearly no font expert either, thus having someone who really likes to dig into the oddities here is highly appreciated. From a pypdf point of view, having this would be great, but for now, we (luckily) do not expose this functionality as public API, thus preventing us from having to consider backwards-compatibility as well. |
OK, I'll forge on then, using this PR for note taking. I'm finding now that it's not even very wrong, it's just overly simplistic. What I've learned in the interim: Font.encoding and Font.character_map can be said to map the same thing: decoded text (int in the context of simple fonts, str in the context of CID fonts) to unicode code points. EDIT LIKE THE /ToUnicode DICT IN A FONT RESOURCE This is helpful for extracting text from pdfs and it can also map different glyph substitutes to single unicode code points, e.g., for Arabic text. For this reason, character_widths actually is wrong, because it maps unicode code points to widths, which does not work if multiple glyphs map to the same unicode code point (e.g., Arabic). Furthermore, it doesn't really work for the reverse logic of producing text. In this PR, I think that I populated character_map in from_truetype_font_file in reverse, mapping unicode code points to character IDs. Otherwise, it actually seems to work, with the caveats that the character_widths are lossy when one unicode code point maps to multiple glyphs. So, for the purposes of abstraction I could actually merge map_dict and encoding without losing any information or functionality:
It would seem to that the underlying pypdf font / encoding / character_map architecture can be improved in three ways:
For this PR, it doesn't matter too much, I can just clean up the logic in character_map and then it should work for fonts where the CIDtoGID map is contiguous (this is another caveat). |
|
@stefan6419846 OK, final verdict for now - character_widths need to be keyed by CID for cid fonts or by character code for simple fonts. My decision to use the character widths code from the new Font class was unfortunately incorrect. I can still fix the above PR, I think, at least logically, and it will also mostly work, but not for any text that needs to be run through a text shaper. (I now, finally, understand that many fonts contain glyphs without a unicode code point, such as ligatures. You cannot address these using a unicode code point, and you also cannot get their widths through a unicode code point. Instead, you need to read their widths and glyphs by CID (for CID fonts) / character code (for simple fonts). This was what the old build_font_width_map code did in _cmap.py.) I'll fix this PR according to the new logic, but then I'll revert the character widths to the old logic that was in _cmap.py, port the old text extraction code back to it (should be simple) and port the layout text extraction code and the appearance stream code to it (will be harder). And that will be a good basis for generating arabic code. |
6919ab5 to
744c593
Compare
|
Something's still weird, font is not listed as embedded... |
|
OK, I may have reached somewhat of a breakthrough. I can now fully embed a font and associated font resource and encode new text. I needed to add a character_map after all, but not in the way that I thought. I can also do so while reusing a compatible font resource. Main remaining problems include:
|
|
I think that this PR now starts to get somewhat useful. As far as I can tell, it now no longer matters whether I create a new /FontDescriptor resource or use the old one. Also, visual text now corresponds with copy-pasted text. I'm going to change the api a little bit so that I can actually embed a font from a ttf file using writer.add_font(). In that way, it is easier to test the new font methods and different encodings. Tests all fail because I didn't fix the new test after adding a character_map. Probably needs another week of work. |
|
I fixed the test, now I'm just filling the form and then extracting the text using PdfReader. This is nice, because nothing changed in the code for text extraction, which means that the new code would be logically the same as the old. |
This enables generating a new unicode font resource in case of text widget values that cannot be encoded with existing font resources.
This patch adds a method to produce a pdf font descriptor resource. For now, we assume that an embedded font file will be a TrueType font.
Also refactor to reduce complexity
This patch adds back some code that got removed earlier and, at that time, did not see any test coverage. With new code that enables adding fonts, I've finally understood that, in some cases, a -1 key will be added to font.character_map. This will cause an encoding failure when generating font_glyph_byte_map.
|
@stefan6419846 Apologies for the spam. I'm still trying to get all code coverage done, which is difficult for the font flags. That said, I've temporarily changed _page.py to initialise fonts from embedded font files in case they are present when extracting text, and all but one tests appear to pass! I think that this means that the approach I've followed so far is actually rather solid. Other than fixing the tests, one thing remains - I think I'm now not adding a font resource if that would overwrite an existing font resource, so I should add something to change the font_name when I'm adding a font. Also, if you would, please comment on the writer.add_font() method. It's not strictly necessary, but it's fun! P.S., multiple tests fail when I'm using embedded font file information for text extraction instead of pdf font resources.. I've omitted all tests that fail because of problems with the embedded font file. The following tests still fail: So, something's not quite right. Then again, the goal of this PR is not to always use an embedded font file for text extraction. And on a more positive note, given how many text extractions pass, I'm now more confident that the character_map derived from a font file is correct, and ought to be compatible with existing code |
This PR adds a new method to _font.py,
from_truetype_font_file, which initialises a Font instance from an embedded font file. I'm assuming that this might also work with a real file. Furthermore, it adds a lot of information to as_font_resource, to enable producing a CID TrueType font resource that enables encoding more characters than a TrueType font resource.This fixes #3361.
Contributes to fixing #3514.
Might be related to #3318. EDIT, it is not.
Includes all work from #3602.
EDIT.
How it works:
We detect if a text value for a text widget annotation can be encoded using an existing font resource. If not, and we have an embedded TrueType font, we assume that we are expected to create a new font resource. We use the embedded font file to initialise a new Font instance, and then produce a new font resource from this instance. After having done so, we make the associated font descriptor an indirect object later on, as per the PDF specification.
Some notes:
I think that the more elegant way would be produce a short embedded font resource with only the characters in the text value. Also, it should have been possible to reuse the original font descriptor, but I can't seem to make that work.