Skip to content

Include break nodes in DOCX documents#87

Merged
harshankur merged 1 commit intoharshankur:masterfrom
matteolutz:feat/docx-page-breaks
Apr 21, 2026
Merged

Include break nodes in DOCX documents#87
harshankur merged 1 commit intoharshankur:masterfrom
matteolutz:feat/docx-page-breaks

Conversation

@matteolutz
Copy link
Copy Markdown
Contributor

@matteolutz matteolutz commented Apr 18, 2026

Description

This PR aims to provide a way to deserialize line and page breaks in DOCX documents (w:br nodes). This feature is behind a new config flag.

Changes

  • Added new config flag includeBreakNodes that is set to false by default.
  • Added new union variant break to OfficeContentNodeType and a new metadata type BreakMetadata containing information about the break type (lineWrapping, page or column) and the break clear behaviour.
  • Added parsing logic in parseWord.
  • Updated the documentation to reflect those changes.

Which internal parser/component does this affect?

  • WordParser (DOCX)
  • PowerPointParser (PPTX)
  • ExcelParser (XLSX)
  • OpenOfficeParser (ODT/ODS/ODP)
  • PDFParser
  • OCR / Worker Pool
  • Infrastructure / Build / CLI
  • Other (Specify Below)

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Other (Specify Below)

Verification Results

Please describe the tests you ran to verify your changes.

  • Build Check: I have run npm run build and no errors occurred in Node or Browser bundles.
  • Parser Baselines: I have run npm test and there are no failures.
  • Manual Proof: Provide a brief snippet or screenshot of the new behavior/output.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Owner

@harshankur harshankur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. I have requested a few changes. Please do them.
Additionally, you did update the docs in the page for officeParser but not the README itself. Please update it too.
Further, please add tests for this as well to check that. Once you do it, run npm test baseline to generate the new baseline json file for docx also. Manually verify its information is exactly how you expect it. Commit that as well.

And please squash your changes.

Comment thread src/parsers/WordParser.ts Outdated
Comment thread src/parsers/WordParser.ts Outdated
Comment thread src/parsers/WordParser.ts Outdated
Comment thread src/types.ts Outdated
Comment thread src/types.ts Outdated
@matteolutz matteolutz force-pushed the feat/docx-page-breaks branch 2 times, most recently from f4abe27 to 9d600df Compare April 20, 2026 12:38
@matteolutz matteolutz changed the title Include line and page breaks in DOCX documents Include break nodes in DOCX documents Apr 20, 2026
@matteolutz matteolutz force-pushed the feat/docx-page-breaks branch from 9d600df to 4e19ff6 Compare April 20, 2026 13:58
@harshankur harshankur merged commit 06036ca into harshankur:master Apr 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants