-
Notifications
You must be signed in to change notification settings - Fork 47
Localization concept needs improvement #40
Description
Hey,
after attending the first ever Schematron Users Meetup at XML Prague this year, I'm thrilled to see that schematron is coming back to life — thanks @rjelliffe, @AndrewSales and @tgraham-antenna for your work!
As a contributor to the EpubCheck project (EPUB validation) and the SQF Schematron QuickFix project, I'd like to open up this issue and start a discussion about improvements to the Schematron localization concepts — or at least for the Skeleton implementation.
The EpubCheck project uses Java properties files for localization, but also has several Schematron checks which cannot be localized at the moment because the official Skeleton implementation used by Jing validator does not support this. There has been discussion about this since October 2014 at issue w3c/epubcheck#474
And more recently, the SQF project struggled with this as well in schematron-quickfix/sqf#1.
Annex G of the ISO Schematron specification defines the use of multilingual Schematron as follows:
Diagnostics in multiple languages may be supported by using a different diagnostic element for each language, with the appropriate xml:lang language attribute, and referencing all the unique identifiers of the diagnostic elements in the diagnostics attribute of the assertion.
Annex G gives a simple example of a multi-lingual schema.
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
<sch:title>Example of Multi-Lingual Schema</sch:title>
<sch:pattern>
<sch:rule context="dog">
<sch:assert test="bone" diagnostics="d1 d2">A dog should have a bone.</sch:assert>
</sch:rule>
</sch:pattern>
<sch:diagnostics>
<sch:diagnostic id="d1" xml:lang="en">A dog should have a bone.</sch:diagnostic>
<sch:diagnostic id="d2" xml:lang="de">Ein Hund sollte ein Bein haben.</sch:diagnostic>
</sch:diagnostics>
</sch:schema>
However, this never worked in the original Skeleton implementation, as it would display both messages and not only the one from the current locale.
oXygen XML has implemented a workaround for this issue with tweaking the original Skeleton implementation and only showing the current locale. Possibly they can contribute this change as a PullRequest.
However, there's another shortcoming of the diagnostic based localization concept: the developer has to actively reference every language with a separate ID in the diagnostics
attribute, which makes it hard to add new localizations.
At XML prague, Octavian from oXygen XML (@octavianN), Nico from the SQF project (@nkutsche), Patrik (@PStellmann) & Vanessa (@vanessakastmann) from the DITA-SEMIA project and me sat together to discuss the SQF issue schematron-quickfix/sqf#1 but quickly came to the conclusion, that there needs to be made improvements to the localization support in the Schematron standard or the Skeleton implementation in order to properly resolve issues like the EpubCheck or SQF one.
We discussed the following solutions which I want to outline here as a discussion basis. You should also know, that we discussed this with the usecase of externalizing the messages to separate files (e.g. fro Translation Memory Systems) in mind.
Solution 1: Fix the Skeleton
The Skeleton should be fixed to at least support the Annex G example properly: Only output the message in the current locale and not ALL diagnostic
elements.
Solution 2: Remove ID/IDREF constraint from Schematron schema
This is more like a long-term solution as the standardized schema would need to be changed.
What we like to achieve is something like this:
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
<sch:title>Example of Multi-Lingual Schema</sch:title>
<sch:pattern>
<sch:rule context="dog">
<sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
</sch:rule>
</sch:pattern>
<sch:diagnostics>
<sch:diagnostic id="d1" xml:lang="en">English message.</sch:diagnostic>
<sch:diagnostic id="d1" xml:lang="de">German message.</sch:diagnostic>
</sch:diagnostics>
</sch:schema>
- Only reference the message ID (which isn't of datatype
ID
anymore) once and let the Skeleton or any other implementation choose the properdiagnostic
element. - Schematron rule: Enforce the
xml:lang
attribute with different values when two or morediagnostic
elements with the sameid
are present.
Current status: This does not validate because of the ID/IDREF datatypes.
Solution 3a: Do it the Java way (hacky)
In Java you just reference messages.properties
file and the PropertyReader
implementation takes care of resolving the current Locale. In a german environment for xample, Java would try and look for messages_de.properties
automatically, although this file isn't referenced in the Java class.
Schematron could do this as follows:
dog.sch
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
<sch:title>Example of Multi-Lingual Schema</sch:title>
<sch:pattern>
<sch:rule context="dog">
<sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
</sch:rule>
</sch:pattern>
<sch:include href="messages.sch"/>
</sch:schema>
messages.sch
:
<sch:diagnostics xml:lang="en">
<sch:diagnostic id="d1">A dog should have a bone.</sch:diagnostic>
</sch:diagnostics>
messages_de.sch
:
<sch:diagnostics xml:lang="de">
<sch:diagnostic id="d1">Ein Hund sollte ein Bein haben.</sch:diagnostic>
</sch:diagnostics>
- The Skeleton would need to be changed to look for
{include}_{locale}.sch
everytime it resolves an include. - That's a bit hacky…
Current status: dog.sch
would validate without errors, but some of our group had reservations because of the misuse of the include
element and also because the german message file messages_de.sch
isn't referenced anywhere within the SCH. Personally(!) I could live well with the last one, as it's Java style...
Solution 3b: Do it the Java way (properly)
To address the issue about misusing the include
element from solution 3a, I'd like to introduce either a new element for message file references:
<sch:messages href="messages.sch"/>
which would require a diagnostics
root element
… or at least an additional attribute on the include
element:
<sch:include href="messages.sch" type="localization"/>
which would advise Skeleton and any other implementation to look for localized files as well (in the Java form of {include}_{locale}.sch
).
Solution 4: Work with business rules for the referenced id
's
In my personal opinion this can't be more than a temporary hack, but it was heavily discussed in the group:
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
<sch:title>Example of Multi-Lingual Schema</sch:title>
<sch:pattern>
<sch:rule context="dog">
<sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
</sch:rule>
</sch:pattern>
<sch:diagnostics>
<sch:diagnostic id="d1">English message.</sch:diagnostic>
<sch:diagnostic id="d1_de">German message.</sch:diagnostic>
</sch:diagnostics>
</sch:schema>
- The Skeleton would need to be changed to look for an ID
{id}_{locale}
diagnostic element if the current locale does not matchxml:lang
on the root element. - That's more than hacky…
Current status: The schematron would validate well.
I layed out different solutions we discussed at our SQF meeting and the more I think about it, the better It would have been to discuss this two days earlier on the Schematron Users Meetup... Anyways...
This should only be a basis for further ongoing discussion and I hope I could make my point why we need improvements to either the standard or the Skeleton.
Kind regards,
Tobias
on behalf of Octavian, Nico, Patrik and Vanessa