Localization concept needs improvement

Hey,

after attending the first ever Schematron Users Meetup at XML Prague this year, I'm thrilled to see that schematron is coming back to life — thanks @rjelliffe, @AndrewSales and @tgraham-antenna for your work!

As a contributor to the EpubCheck project (EPUB validation) and the SQF Schematron QuickFix project, I'd like to open up this issue and start a discussion about improvements to the Schematron localization concepts — or at least for the Skeleton implementation.

The EpubCheck project uses Java properties files for localization, but also has several Schematron checks which cannot be localized at the moment because the official Skeleton implementation used by Jing validator does not support this. There has been discussion about this since October 2014 at issue https://github.com/IDPF/epubcheck/issues/474

And more recently, the SQF project struggled with this as well in https://github.com/schematron-quickfix/sqf/issues/1.

*Annex G* of the ISO Schematron specification defines the use of multilingual Schematron as follows:

> Diagnostics in multiple languages may be supported by using a different diagnostic element for each language, with the appropriate xml:lang language attribute, and referencing all the unique identifiers of the diagnostic elements in the diagnostics attribute of the assertion. 
> Annex G gives a simple example of a multi-lingual schema.
```xml
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
    <sch:title>Example of Multi-Lingual Schema</sch:title>
    <sch:pattern>
        <sch:rule context="dog">
            <sch:assert test="bone" diagnostics="d1 d2">A dog should have a bone.</sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:diagnostics>
        <sch:diagnostic id="d1" xml:lang="en">A dog should have a bone.</sch:diagnostic>
        <sch:diagnostic id="d2" xml:lang="de">Ein Hund sollte ein Bein haben.</sch:diagnostic>
    </sch:diagnostics>
</sch:schema>
```

However, this never worked in the original Skeleton implementation, as it would display **both** messages and not only the one from the current locale.

oXygen XML has implemented a workaround for this issue with tweaking the original Skeleton implementation and only showing the current locale. Possibly they can contribute this change as a PullRequest.

However, there's another shortcoming of the diagnostic based localization concept: the developer has to actively reference every language with a separate ID in the `diagnostics` attribute, which makes it hard to add new localizations.

At XML prague, Octavian from oXygen XML (@octavianN), Nico from the SQF project (@nkutsche), Patrik (@PStellmann) & Vanessa (@vanessakastmann) from the DITA-SEMIA project and me sat together to discuss the SQF issue https://github.com/schematron-quickfix/sqf/issues/1 but quickly came to the conclusion, that there needs to be made improvements to the localization support in the Schematron standard or the Skeleton implementation in order to properly resolve issues like the EpubCheck or SQF one.

We discussed the following solutions which I want to outline here as a discussion basis. You should also know, that we discussed this with the usecase of externalizing the messages to separate files (e.g. fro Translation Memory Systems) in mind.

### Solution 1: Fix the Skeleton

The Skeleton should be fixed to at least support the Annex G example properly: Only output the message in the current locale and not ALL `diagnostic` elements.

### Solution 2: Remove ID/IDREF constraint from Schematron schema

This is more like a long-term solution as the standardized schema would need to be changed.

What we like to achieve is something like this:

```xml
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
    <sch:title>Example of Multi-Lingual Schema</sch:title>
    <sch:pattern>
        <sch:rule context="dog">
            <sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:diagnostics>
        <sch:diagnostic id="d1" xml:lang="en">English message.</sch:diagnostic>
        <sch:diagnostic id="d1" xml:lang="de">German message.</sch:diagnostic>
    </sch:diagnostics>
</sch:schema>
```

1. Only reference the message ID (which isn't of datatype `ID` anymore) once and let the Skeleton or any other implementation choose the proper `diagnostic` element.
2. Schematron rule: Enforce the `xml:lang` attribute with different values when two or more `diagnostic` elements with the same `id` are present.

Current status: This does not validate because of the ID/IDREF datatypes.

### Solution 3a: Do it the Java way (hacky)

In Java you just reference `messages.properties` file and the `PropertyReader` implementation takes care of resolving the current Locale. In a german environment for xample, Java would try and look for `messages_de.properties` automatically, although this file isn't referenced in the Java class.

Schematron could do this as follows:

`dog.sch`
```xml
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
    <sch:title>Example of Multi-Lingual Schema</sch:title>
    <sch:pattern>
        <sch:rule context="dog">
            <sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:include href="messages.sch"/>
</sch:schema>
```

`messages.sch`:
```xml
<sch:diagnostics xml:lang="en">
    <sch:diagnostic id="d1">A dog should have a bone.</sch:diagnostic>
</sch:diagnostics>
```

`messages_de.sch`:
```xml
<sch:diagnostics xml:lang="de">
    <sch:diagnostic id="d1">Ein Hund sollte ein Bein haben.</sch:diagnostic>
</sch:diagnostics>
```

1. The Skeleton would need to be changed to look for `{include}_{locale}.sch` everytime it resolves an include.
2. That's a bit *hacky*…

Current status: `dog.sch` would validate without errors, but some of our group had reservations because of the misuse of the `include` element and also because the german message file `messages_de.sch` isn't referenced anywhere within the SCH. *Personally(!)* I could live well with the last one, as it's Java style...


### Solution 3b: Do it the Java way (properly)

To address the issue about misusing the `include` element from solution 3a, I'd like to introduce *either* a new element for message file references:

```xml
<sch:messages href="messages.sch"/>
```
which would require a `diagnostics` root element

… *or* at least an additional attribute on the `include` element:

```xml
<sch:include href="messages.sch" type="localization"/>
```
which would advise Skeleton and any other implementation to look for localized files as well (in the Java form of `{include}_{locale}.sch`).

### Solution 4: Work with business rules for the referenced `id`'s

In my personal opinion this can't be more than a temporary hack, but it was heavily discussed in the group:

```xml
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" xml:lang="en" >
    <sch:title>Example of Multi-Lingual Schema</sch:title>
    <sch:pattern>
        <sch:rule context="dog">
            <sch:assert test="bone" diagnostics="d1">(Optional) Fallback message.</sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:diagnostics>
        <sch:diagnostic id="d1">English message.</sch:diagnostic>
        <sch:diagnostic id="d1_de">German message.</sch:diagnostic>
    </sch:diagnostics>
</sch:schema>
```

1. The Skeleton would need to be changed to look for an ID `{id}_{locale}` diagnostic element if the current locale does not match `xml:lang` on the root element.
2. That's more than *hacky*…

Current status: The schematron would validate well.

----

I layed out different solutions we discussed at our SQF meeting and the more I think about it, the better It would have been to discuss this two days earlier on the Schematron Users Meetup... Anyways...

This should only be a basis for further ongoing discussion and I hope I could make my point why we need improvements to either the standard or the Skeleton.

Kind regards,
Tobias

*on behalf of Octavian, Nico, Patrik and Vanessa*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Localization concept needs improvement #40

Solution 1: Fix the Skeleton

Solution 2: Remove ID/IDREF constraint from Schematron schema

Solution 3a: Do it the Java way (hacky)

Solution 3b: Do it the Java way (properly)

Solution 4: Work with business rules for the referenced `id`'s

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Localization concept needs improvement #40

Description

Solution 1: Fix the Skeleton

Solution 2: Remove ID/IDREF constraint from Schematron schema

Solution 3a: Do it the Java way (hacky)

Solution 3b: Do it the Java way (properly)

Solution 4: Work with business rules for the referenced id's

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Solution 4: Work with business rules for the referenced `id`'s