Skip to content

Conversation

lokendersinghft
Copy link
Contributor

@lokendersinghft lokendersinghft commented Sep 24, 2025

This PR introduces a Go-based bodyXML → content tree transformer. The new transformer produces output in 1:1 parity with the current Node.js implementation, with the following differences:

  • Experimental blocks: If an experimental block contains an unrecognised
    or nested elements that do not conform to the content tree schema, this transformer throws an error.
  • Tweet: Unlike the existing Node.js transformer, this implementation supports "tweet" tags.
  • Clipset: A Clipset type is defined in Go, and transformation logic has been implemented. This logic can be refined further as needed.

JIRA --> https://financialtimes.atlassian.net/browse/UPPSF-6453

@lokendersinghft lokendersinghft requested review from a team as code owners September 24, 2025 13:37
@lokendersinghft lokendersinghft changed the title transforms body-tree format to external bodyXML in Go transform body-tree format to external bodyXML in Go Sep 25, 2025
@lokendersinghft lokendersinghft changed the title transform body-tree format to external bodyXML in Go transform content-tree to external bodyXML in Go Sep 25, 2025
Copy link
Contributor

@epavlova epavlova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very important bit of the journey 👏


// Transform converts an external XHTML-formatted document into a content tree.
// It returns an error if the input contains unsupported HTML elements
// or does not comply with the content tree schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (non-blocking): This is just food for thought. The JSON schemas have stricter rules than what we represent with the Go types. So I am thinking that we may want to still validate what is produced by this Transform. But I think it's better to do so where this Transform is used, in our case in the Content Tree API.

"http://www.ft.com/ontology/content/CustomCodeComponent": "/content/{{id}}",
"http://www.ft.com/ontology/content/MediaResource": "/content/{{id}}",
"http://www.ft.com/ontology/content/Video": "/content/{{id}}",
"http://www.ft.com/ontology/company/PublicCompany": "/organisations/{{id}}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I am pretty sure that some old content pieces will have:
"http://www.ft.com/ontology/company/Organisation", let's add this one as well.

Copy link
Contributor Author

@lokendersinghft lokendersinghft Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mimicked the transformation we perform in content-public-read. I don't see this URL their in the configs.

content-public-read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we map http://www.ft.com/ontology/company/Organisation to /organisations/{{id}}?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my previous comment is not very useful. I saw that contentTypeTemplates is used in generateUrl and used in mapping t.Tag == "concept". However if we need to count for the <concept>/<ft-concept> tags, then we need to support a full mapping of all concepts that could go in the tag like Organisation, Person, etc
However we decided that we are going to remove the tag as part of @vlasakievft efforts to clean the historical content.
@lokendersinghft I suggest to keep the list as it is. At some point the transformer shouldn't need to work with <concept>/<ft-concept>.

@lokendersinghft lokendersinghft force-pushed the feature/UPPSF-6436-xml-to-content-tree-go branch from cb15bbd to 33fa4d2 Compare September 25, 2025 09:07
Comment on lines +37 to +40
out := &contenttree.Root{
Type: contenttree.RootType,
Body: m,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, thought: Do we want to return this root node from the transformer? Isn't it the idea to give old (and new content published without tree) a tree body representation using this transformer. The current schema we use for validating payloads expects the following format:

{ "type": "body", "version": 1, "children": [ { ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to keep it this way so that we can easily introduce stuff to it later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Consider whether you can separate the convertToContentTree into smaller methods. For example one for the case *etree.Element and one for the *etree.CharData. In this way you would avoid the nested switch statements.

Comment on lines +48 to +62
var contentType = struct {
ImageSet string
Video string
Content string
Article string
CustomCodeComponent string
ClipSet string
}{
ImageSet: "http://www.ft.com/ontology/content/ImageSet",
Video: "http://www.ft.com/ontology/content/Video",
Content: "http://www.ft.com/ontology/content/Content",
Article: "http://www.ft.com/ontology/content/Article",
CustomCodeComponent: "http://www.ft.com/ontology/content/CustomCodeComponent",
ClipSet: "http://www.ft.com/ontology/content/ClipSet",
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(non-blocking): Consider this:

type contentType = string
const (
	ImageSet            contentType = "http://www.ft.com/ontology/content/ImageSet"
	Video                           = "http://www.ft.com/ontology/content/Video"
	Content                         = "http://www.ft.com/ontology/content/Content"
	Article                         = "http://www.ft.com/ontology/content/Article"
	CustomCodeComponent             = "http://www.ft.com/ontology/content/CustomCodeComponent"
	ClipSet                         = "http://www.ft.com/ontology/content/ClipSet"
)


## Overview
The Transformer converts external XHTML-formatted document into content tree.
It supports format stored in the **internalComponent** collection as well as the one returned by the **Internal Content API**.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: internalcomponents collection.

nitpick: Maybe we can expand the description slightly to mention what is the difference between the two representations supported by the transformer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added tad bit of more information on this.

@lokendersinghft lokendersinghft force-pushed the feature/UPPSF-6436-xml-to-content-tree-go branch from 33fa4d2 to 807a56c Compare September 25, 2025 10:03
@lokendersinghft lokendersinghft requested review from epavlova and todor-videv1 and removed request for epavlova September 26, 2025 07:46
@epavlova epavlova requested a review from a team September 26, 2025 11:28
@lokendersinghft lokendersinghft changed the title transform content-tree to external bodyXML in Go transform bodyXMLto external content-tree in Go Sep 29, 2025
@lokendersinghft lokendersinghft force-pushed the feature/UPPSF-6436-xml-to-content-tree-go branch from 807a56c to 66c99a0 Compare September 29, 2025 07:40
@lokendersinghft lokendersinghft merged commit b71779d into main Sep 29, 2025
2 checks passed
@lokendersinghft lokendersinghft deleted the feature/UPPSF-6436-xml-to-content-tree-go branch September 29, 2025 07:41
@lokendersinghft lokendersinghft restored the feature/UPPSF-6436-xml-to-content-tree-go branch October 6, 2025 13:30
@lokendersinghft lokendersinghft deleted the feature/UPPSF-6436-xml-to-content-tree-go branch October 6, 2025 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants