Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,31 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## [Unreleased]

## [0.7.0] - 2026-03-14

### Added
- Unit tests for `blocks.lua`, `academic.lua`, `reader_inlines.lua`, `reader_blocks.lua`, `reader_academic.lua` (235 new assertions, 505 total across 11 test files)
- Golden baseline comparison (`make test-golden`) in CI pipeline
- `make test-all` target for single-command full validation (lint + test + golden + reader + validate)
- Wildcard test discovery in `make test-unit` (auto-discovers new test files)
- Unit tests for task list checkboxes, subfigure extraction, LineBlock handler, inline edge cases (Quoted, SmallCaps, Span anchors), and reader block handler output (130 new assertions, 621 total)
- Reader round-trip golden baselines (`make test-reader-golden`, `make update-reader-golden`)
- Shared `THEOREM_VARIANTS` and `THEOREM_REF_PREFIXES` constants in `lib/utils.lua`
- SECURITY.md with supported version policy

### Changed
- **Content version** bumped from `0.7.0` to `0.7.1` to align with cdx-core v0.7.1
- Aligned `make lint` flags with CI (`--no-unused-args --no-max-line-length`)
- Updated CONTRIBUTING.md pre-PR command to `make test-all`
- Consolidated theorem variant and reference prefix definitions from `academic.lua`/`inlines.lua` into shared `utils.lua` constants
- Removed dead code: unused `caption` parameter from `blocks.image()`, unused `convert_inline`/`_handlers` exports from `inlines.lua`, `has_class` re-export from `inlines.lua`

### Fixed
- README Pandoc version requirement: corrected from 2.11+ to 3.0+
- Table cell complex content: now collects all child nodes from nested blocks, not just text nodes
- Measurement sentinel: defaults to `value=0`/`unit=""` with stderr warnings for unparseable input
- RawBlock format preservation: sets `language` field from block format (e.g., `html`, `latex`)
- Citation prefix/suffix: prefix applied only to first citation, suffix/locator only to last (multi-citation fix)

## [0.6.0] - 2026-02-17

Expand Down
27 changes: 26 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ TEST_INPUTS := $(wildcard tests/inputs/*.md)
TEST_OUTPUTS := $(patsubst tests/inputs/%.md,tests/outputs/%.json,$(TEST_INPUTS))
TEST_CDX := $(patsubst tests/inputs/%.md,tests/outputs/%.cdx,$(TEST_INPUTS))

.PHONY: all test clean test-json test-cdx test-reader test-unit test-golden test-all help check-deps validate-schema lint
.PHONY: all test clean test-json test-cdx test-reader test-unit test-golden test-reader-golden test-all help check-deps validate-schema lint

all: test

Expand Down Expand Up @@ -139,6 +139,31 @@ update-golden: test-json
done
@echo "Golden baselines updated."

# Compare reader round-trip outputs against golden baselines
# Note: Pandoc Span/Div attribute ordering is non-deterministic between process
# invocations, so this target is NOT included in test-all. Use update-reader-golden
# to capture baselines, then run test-reader-golden in the same make invocation:
# make update-reader-golden && make test-reader-golden
test-reader-golden: test-reader
@echo "Comparing reader round-trip against golden outputs..."
@fail=0; for f in tests/outputs/*.roundtrip.md; do \
base=$$(basename $$f); \
if [ -f tests/expected/$$base ]; then \
diff -q $$f tests/expected/$$base > /dev/null 2>&1 || { echo "DIFF: $$base"; diff tests/expected/$$base $$f | head -20; fail=1; }; \
fi; \
done; \
[ $$fail -eq 0 ] && echo "All reader golden tests passed." || { echo "Reader golden test failures detected."; exit 1; }

# Regenerate reader round-trip golden baselines
update-reader-golden: test-reader
@echo "Updating reader round-trip golden baselines..."
@mkdir -p tests/expected
@for f in tests/outputs/*.roundtrip.md; do \
base=$$(basename $$f); \
cp $$f tests/expected/$$base; \
done
@echo "Reader golden baselines updated."

# Run all tests (unit + integration + golden + reader + lint + validate)
test-all: lint test test-golden test-reader validate

Expand Down
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ The writer produces a JSON structure with three sections:
}
},
"content": {
"version": "0.7.0",
"version": "0.7.1",
"blocks": [...]
},
"dublin_core": {
Expand Down Expand Up @@ -380,6 +380,10 @@ The reader converts Codex back to standard Pandoc elements. Most block types sur
- [codex-file-format-spec](https://github.com/Entrolution/codex-file-format-spec) - Format specification
- [cdx-core](https://github.com/Entrolution/cdx-core) - Rust library and CLI

## Security

See [SECURITY.md](SECURITY.md) for the supported version policy and how to report vulnerabilities.

## Contributing

Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) before submitting PRs.
Expand Down
29 changes: 29 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Security Policy

## Supported Versions

| Version | Supported |
|---------|--------------------|
| 0.7.x | :white_check_mark: |
| < 0.7.0 | :x: |

Only the latest patch release of the current minor version receives security updates.

## Reporting a Vulnerability

If you discover a security vulnerability, please report it responsibly:

1. **Do not** open a public issue.
2. Email the maintainers or use [GitHub Security Advisories](https://github.com/Entrolution/cdx-pandoc/security/advisories/new) to report privately.
3. Include a description of the vulnerability, steps to reproduce, and any potential impact.

We will acknowledge receipt within 48 hours and aim to provide a fix or mitigation plan within 7 days.

## Scope

cdx-pandoc is a document conversion tool that processes untrusted input (Markdown, LaTeX, Word, etc.) via Pandoc. Security considerations include:

- **Input handling**: The Lua writer processes Pandoc AST structures. Malformed input is handled by Pandoc's parser before reaching this code.
- **Shell script**: `scripts/pandoc-to-cdx.sh` invokes external tools (`pandoc`, `jq`, `sha256sum`, `zip`). File paths are quoted to prevent injection.
- **No network access**: The writer and reader operate entirely offline with no network calls.
- **No code execution**: The writer produces static JSON output. No user-supplied code is evaluated.
2 changes: 1 addition & 1 deletion codex.lua
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ blocks.set_academic(academic)

-- Spec version
local CODEX_VERSION = "0.1"
local CONTENT_VERSION = "0.7.0"
local CONTENT_VERSION = "0.7.1"

-- Generate ISO 8601 timestamp
local function iso_timestamp()
Expand Down
36 changes: 23 additions & 13 deletions lib/academic.lua
Original file line number Diff line number Diff line change
Expand Up @@ -23,19 +23,21 @@ function M.set_extension_tracker(tracker)
track_extension = tracker or function() end
end

-- Theorem variant set
local theorem_variants = {
theorem = true, lemma = true, proposition = true, corollary = true,
definition = true, conjecture = true, remark = true, example = true
}
-- Theorem variant set (from shared constants)
local theorem_variants = utils.THEOREM_VARIANTS

-- Academic class set (all classes handled by this module)
local academic_classes = {
theorem = true, lemma = true, proposition = true, corollary = true,
definition = true, conjecture = true, remark = true, example = true,
proof = true, exercise = true, ["exercise-set"] = true,
algorithm = true, abstract = true, ["equation-group"] = true
}
-- Derived from theorem variants plus non-theorem academic types
local academic_classes = {}
for k in pairs(theorem_variants) do
academic_classes[k] = true
end
academic_classes.proof = true
academic_classes.exercise = true
academic_classes["exercise-set"] = true
academic_classes.algorithm = true
academic_classes.abstract = true
academic_classes["equation-group"] = true

-- Classify a Div as an academic block type or nil
-- @param classes Array of CSS classes
Expand Down Expand Up @@ -363,12 +365,20 @@ function M.equation_group(text)
return nil
end

-- Split on \\ (double backslash) line breaks using plain string find
-- Cannot use gmatch character class here: [^\\\\]+ splits on every single
-- backslash, which would fragment LaTeX commands like \frac, \alpha, etc.
local lines = {}
for line in inner:gmatch("[^\\\\]+") do
local trimmed = line:match("^%s*(.-)%s*$")
local pos = 1
while pos <= #inner do
local s, e = inner:find("\\\\", pos, true)
local segment = s and inner:sub(pos, s - 1) or inner:sub(pos)
local trimmed = segment:match("^%s*(.-)%s*$")
if trimmed and trimmed ~= "" then
table.insert(lines, trimmed)
end
if not s then break end
pos = e + 1
end

return {
Expand Down
22 changes: 8 additions & 14 deletions lib/blocks.lua
Original file line number Diff line number Diff line change
Expand Up @@ -83,12 +83,16 @@ block_handlers.DefinitionList = function(block) return M.definition_list(block)
block_handlers.Figure = function(block) return M.figure(block) end

block_handlers.RawBlock = function(block)
return {
local result = {
type = "codeBlock",
children = {
{type = "text", value = block.text}
}
}
if block.format and block.format ~= "" then
result.language = block.format
end
return result
end

block_handlers.LineBlock = function(block)
Expand Down Expand Up @@ -517,13 +521,11 @@ function M.table_cell(cell)
if #blocks > 0 and blocks[1].type == "paragraph" then
children = blocks[1].children
else
-- Complex cell content - just use first text we find
-- Complex cell content - collect all children from converted blocks
for _, b in ipairs(blocks) do
if b.children then
for _, c in ipairs(b.children) do
if c.type == "text" then
table.insert(children, c)
end
table.insert(children, c)
end
end
end
Expand Down Expand Up @@ -807,7 +809,7 @@ function M.extract_subfigure(div, attrs, div_id)
end

-- Convert Image inline to image block (with optional dimensions)
function M.image(img, caption)
function M.image(img)
local src = img.src or img.target or ""
local alt = ""

Expand Down Expand Up @@ -840,14 +842,6 @@ function M.image(img, caption)
end
end

-- Caption passed from outside (legacy path)
if caption and caption.long then
local cap_text = pandoc.utils.stringify(caption.long)
if cap_text and cap_text ~= "" then
result.title = cap_text
end
end

return result
end

Expand Down
31 changes: 14 additions & 17 deletions lib/inlines.lua
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
-- Load shared utilities
local utils = dofile((PANDOC_SCRIPT_FILE and (PANDOC_SCRIPT_FILE:match("(.*/)" ) or "") or "") .. "lib/utils.lua")
local deep_copy = utils.deep_copy
local has_class = utils.has_class

local M = {}

Expand Down Expand Up @@ -37,9 +38,6 @@ function M.reset_context()
M._default_context = M.new_context()
end

-- Check if a class list contains a specific class (delegates to utils)
M.has_class = utils.has_class

-- Check if two marks are equal
local function marks_equal(m1, m2)
if type(m1) ~= type(m2) then
Expand Down Expand Up @@ -199,11 +197,8 @@ inline_handlers.Code = function(inline, marks, ctx)
return {M.text_node(inline.text, new_marks)}
end

-- Academic cross-reference prefix patterns
local theorem_ref_prefixes = {
["thm-"] = true, ["lem-"] = true, ["prop-"] = true, ["cor-"] = true,
["def-"] = true, ["conj-"] = true, ["rem-"] = true, ["ex-"] = true
}
-- Academic cross-reference prefix patterns (from shared constants)
local theorem_ref_prefixes = utils.THEOREM_REF_PREFIXES

local function detect_academic_ref(target)
if not target or not target:match("^#") then
Expand Down Expand Up @@ -358,7 +353,7 @@ inline_handlers.Span = function(inline, marks, ctx)
local has_semantic_mark = false

-- Entity class
if M.has_class(classes, "entity") then
if has_class(classes, "entity") then
track_extension(utils.EXT_SEMANTIC)
local entity_mark = {type = "semantic:entity"}
if attributes.uri then
Expand All @@ -377,7 +372,7 @@ inline_handlers.Span = function(inline, marks, ctx)
end

-- Glossary class
if M.has_class(classes, "glossary") then
if has_class(classes, "glossary") then
track_extension(utils.EXT_SEMANTIC)
local glossary_mark = {type = "semantic:glossary"}
if attributes.ref then
Expand All @@ -391,10 +386,18 @@ inline_handlers.Span = function(inline, marks, ctx)
end

-- Measurement class
if M.has_class(classes, "measurement") then
if has_class(classes, "measurement") then
local text = pandoc.utils.stringify(inline.content)
local value = tonumber(attributes.value) or tonumber(text:match("([%d%.]+)"))
local unit = attributes.unit or text:match("%d+%.?%d*%s*(%a+)")
if not value then
io.stderr:write("Warning: measurement missing parseable value: " .. text .. "\n")
value = 0
end
if not unit then
io.stderr:write("Warning: measurement missing parseable unit: " .. text .. "\n")
unit = ""
end
return {{
type = "measurement_sentinel",
value = value,
Expand Down Expand Up @@ -452,12 +455,6 @@ convert_inline = function(inline, marks, ctx)
return {}
end

-- Export convert_inline as module function
M.convert_inline = convert_inline

-- Expose handlers table for testing
M._handlers = inline_handlers

-- Check if a node is a sentinel (non-text node that needs block-level handling)
local function is_sentinel(node)
return node.type and node.type:match("_sentinel$")
Expand Down
31 changes: 25 additions & 6 deletions lib/metadata.lua
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,8 @@ local function extract_authors(meta)

local t = author.t or author.tag

if t == "MetaList" or (type(author) == "table" and #author > 0) then
-- MetaList is always a list of authors
if t == "MetaList" then
local authors = {}
for _, a in ipairs(author) do
local author_obj = extract_single_author(a)
Expand All @@ -120,11 +121,29 @@ local function extract_authors(meta)
if #authors > 0 then
return authors
end
else
-- Single author
local author_obj = extract_single_author(author)
if author_obj then
return { author_obj }
return nil
end

-- MetaInlines, MetaString, or Pandoc 3.x Inlines → single author
-- Try extracting as a single author first; this handles both tagged
-- MetaInlines/MetaString and Pandoc 3.x Inlines (which have no tag
-- but are stringify-able)
local single = extract_single_author(author)
if single then
return { single }
end

-- Fallback: untagged table that isn't stringify-able → try as list
if type(author) == "table" and not t and #author > 0 then
local authors = {}
for _, a in ipairs(author) do
local author_obj = extract_single_author(a)
if author_obj then
table.insert(authors, author_obj)
end
end
if #authors > 0 then
return authors
end
end

Expand Down
3 changes: 2 additions & 1 deletion lib/reader_blocks.lua
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,8 @@ function M.convert_list_item(item)
for _, child in ipairs(item.children or {}) do
local converted = M.convert_block(child)
if converted then
if type(converted) == "table" and converted.tag then
-- Pandoc elements are userdata with t/tag accessors, not plain tables
if converted.t or converted.tag then
table.insert(blocks, converted)
elseif type(converted) == "table" and #converted > 0 then
for _, b in ipairs(converted) do
Expand Down
Loading
Loading