Skip to content

SIARD 2.1 / 2.1.1 metadata.xsd analysis of spec and tools implementations #55

@solfeggietto

Description

@solfeggietto

Conclusion

metadata.xml should be unique for every version of SIARD n.n[.n] with a possibility to make revisions within a version. Any revision should be commented at top with a log history of changes. The SIARD producing tools should use the exact files that is published for metadata.xsd, so the version also can be verified with a checksum control.

The bottom line in those requirements is the fact that SIARD extractions are used for long term preservation of database system, so every part of the specification and usage of schemas should be stable and with no risk of different interpretations of usage. If a SIARD tool in a specific case need do a change from the standard schema, this should be documented in top of the schema as comment with a log history where the source metadata.xsd before changes is set as well as the implemented changes.

Today there are no good unique identifiers in the header-comment section at top of the xsd schema, which makes it a hard job to be sure which version of the metadata.xsd that is in use without extensive diffcheck. In addition the standard and the tools creating this file in the SIARD-extractions are using a mixture of UTF8 without and with BOM as well as line endings (windows CR LF, Unix LF and Linux CR), as well as different indent characters and numbers of them. The line endings and UTF8 (BOM) is not a problem, but the original metadata.xsd should be used as is, with the checksum that then can be easily and automatically controlled in the production line.

Sources

SIARD 2.1.1

Current/Latest implementation of SIARD specification.

"The present version 2.1.1 documents the current state of the SIARD file format. It has been developed by the eCH Fachgruppe on Digital Preservation but is no official standards by eCH. It is identical to version 2.1 save for a few precisions in the wording."

eCH
https://kost-ceco.ch/cms/index.php?siard_de

metadata.xsd, 33.1K, 20.12.18, MD5: ffcfb3243c8662663baf2f9b6ebc92e3
https://kost-ceco.ch/cms/dl/76920d54da523e8f97f5c441befdf0bc/metadata.xsd

DILCIS Board
metadata.xsd, MD5: ffcfb3243c8662663baf2f9b6ebc92e3
SIARD schema (= latest 2.1.1 2019-05-15): https://github.com/DILCISBoard/SIARD/blob/master/schema/metadata.xsd
SIARD 2.1.1 2019-05-15: https://github.com/DILCISBoard/SIARD/blob/master/SIARD%202.1.1/format/2019-05-15/metadata.xsd

Siard Suite v2.1.134
metadata.xsd, MD5: ffcfb3243c8662663baf2f9b6ebc92e3

Exact match in both references to SIARD 2.1.1 spec metadata.xsd and SIARD Suite v2.1.134 extraction usage.

SIARD 2.1

"In practical terms, the SIARD 2.1 metadata.xsd can be used as is, only minor changes up to SIARD 2.1.1 metadata.xsd."

eCH
https://kost-ceco.ch/cms/index.php?siard_de

  • The link for SIARD 2.1 metadata.xsd is the same as for SIARD 2.1.1, so SIARD 2.1 version not available on that webpage

DILCIS Board
metadata.xsd, MD5: 37383C3024D52693BABD1CD8F0AE3391
SIARD 2.1 2018-02-15: https://github.com/DILCISBoard/SIARD/blob/master/SIARD%202.1/format/2018-02-15/metadata.xsd

Spectral Core Full Convert v21.08.1653
metadata.xsd, MD5: 302884ECBD002F5E00835564FFF999FA
== eCH SIARD 2.1 (except 2 comment diff & Unix LF, UTF-8-BOM)

SIARD Tools

DBPTK Desktop / Developer

DBPTK Desktop v2.5.9 = DBPTK Developer v2.9.9: Practical test extraction

metadata.xsd expands from SIARD 2.1.1 with simpleType predefinedTypeType expanded with ARRAYS definitions

https://database-preservation.com/
https://github.com/keeps/dbptk-desktop
https://github.com/keeps/dbptk-enterprise
https://github.com/keeps/dbptk-developer

KEEPS should document somewhere (maybe they have already) the reason of adding ARRAYS definitions and what effect this has (pros and cons) in the general usage and for interoperability of SIARD 2.1.1.

SIARD Suite / GUI 2.1

SIARD Gui v2.1.134: Practical test extraction

metadata.xsd exact checksum as SIARD 2.1.1 published by CH & DILCIS Board

https://github.com/sfa-siard
https://github.com/sfa-siard/SiardGui

Spectral Core Full Convert

Spectral Core Full Convert v21.08.1653: Practical test extraction

metadata.xsd exact as SIARD 2.1 published by DILCIS Board (except 2 comment diff & Unix LF, UTF-8-BOM)

https://www.spectralcore.com/fullconvert
https://www.spectralcore.com/fullconvert/changelog

Spectral Core should upgrade to use the SIARD 2.1.1 metadata.xsd eact as published (MD5 match).

Diff results

SIARD 2.1.1 versus SIARD 2.1

<xs:simpleType name="predefinedTypeType">

SIARD 2.1.1: <xs:pattern value="(CHARACTER\s+VARYING|CHAR\s+VARYING|VARCHAR)(\s*(\s*[1-9]\d*\s*))?"/>
SIARD 2.1: <xs:pattern value="(CHARACTER\s+VARYING|CHAR\s+VARYING|VARCHAR)(\s*(\s*[1-9]\d*\s*))"/>

SIARD 2.1.1: xs:pattern value="(NATIONAL\s+CHARACTER\s+VARYING|NATIONAL\s+CHAR\s+VARYING|NCHAR VARYING)(\s*(\s*[1-9]\d*\s*))?"/>
SIARD 2.1: xs:pattern value="(NATIONAL\s+CHARACTER\s+VARYING|NATIONAL\s+CHAR\s+VARYING|NCHAR VARYING)(\s*(\s*[1-9]\d*\s*))"/>

SIARD 2.1.1: <xs:pattern value="(BINARY\s+VARYING|VARBINARY)(\s*(\s*[1-9]\d*\s*))?"/>
SIARD 2.1: <xs:pattern value="(BINARY\s+VARYING|VARBINARY)(\s*(\s*[1-9]\d*\s*))"/>

SIARD 2.1.1 versus DBPTK Desktop v2.5.9 / Developer v2.9.9

Comapared line by line (statement by statement):
SIARD 2.1.1 Upper line
DBPTK Desktop v2.5.9 / Developer v2.9.9 lower line

<xs:pattern value="INTEGER|INT|SMALLINT|BIGINT"/>
<xs:pattern value="(INTEGER|INT|SMALLINT|BIGINT)(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(NUMERIC|DECIMAL|DEC)(\s*(\s*[1-9]\d*\s*(,\s*\d+\s*)?))?"/>
<xs:pattern value="(NUMERIC|DECIMAL|DEC)(\s*(\s*[1-9]\d*\s*(,\s*\d+\s*)?))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="REAL|DOUBLE PRECISION"/>
<xs:pattern value="REAL|DOUBLE PRECISION(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="FLOAT(\s*(\s*[1-9]\d*\s*))?"/>
<xs:pattern value="FLOAT(\s*(\s*[1-9]\d*\s*))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(CHARACTER|CHAR)(\s*(\s*[1-9]\d*\s*))?"/>
<xs:pattern value="(CHARACTER|CHAR)(\s*(\s*[1-9]\d*\s*))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(CHARACTER\s+VARYING|CHAR\s+VARYING|VARCHAR)(\s*(\s*[1-9]\d*\s*))?"/>
<xs:pattern value="(CHARACTER\s+VARYING|CHAR\s+VARYING|VARCHAR)(\s*(\s*[1-9]\d*\s*))(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(CHARACTER\s+LARGE\s+OBJECT|CLOB)(\s*(\s*[1-9]\d*(\s*(K|M|G))?\s*))?"/>
<xs:pattern value="(CHARACTER\s+LARGE\s+OBJECT|CLOB)(\s*(\s*[1-9]\d*(\s*(K|M|G))?\s*))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(NATIONAL\s+CHARACTER|NATIONAL\s+CHAR|NCHAR)(\s*(\s*[1-9]\d*\s*))?"/>
<xs:pattern value="(NATIONAL\s+CHARACTER|NATIONAL\s+CHAR|NCHAR)(\s*(\s*[1-9]\d*\s*))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(NATIONAL\s+CHARACTER\s+VARYING|NATIONAL\s+CHAR\s+VARYING|NCHAR VARYING)(\s*(\s*[1-9]\d*\s*))?"/>
<xs:pattern value="(NATIONAL\s+CHARACTER\s+VARYING|NATIONAL\s+CHAR\s+VARYING|NCHAR VARYING)(\s*(\s*[1-9]\d*\s*))(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(NATIONAL\s+CHARACTER\s+LARGE\s+OBJECT|NCHAR\s+LARGE\s+OBJECT|NCLOB)(\s*(\s*[1-9]\d*(\s*(K|M|G))?\s*))?"/>
<xs:pattern value="(NATIONAL\s+CHARACTER\s+LARGE\s+OBJECT|NCHAR\s+LARGE\s+OBJECT|NCLOB)(\s*(\s*[1-9]\d*(\s*(K|M|G))?\s*))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="XML"/>
<xs:pattern value="XML(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="BINARY(\s*(\s*[1-9]\d*\s*))?"/>
<xs:pattern value="BINARY(\s*(\s*[1-9]\d*\s*))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(BINARY\s+VARYING|VARBINARY)(\s*(\s*[1-9]\d*\s*))?"/>
<xs:pattern value="(BINARY\s+VARYING|VARBINARY)(\s*(\s*[1-9]\d*\s*))(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(BINARY\s+LARGE\s+OBJECT|BLOB)(\s*(\s*[1-9]\d*(\s*(K|M|G))?\s*))?"/>
<xs:pattern value="(BINARY\s+LARGE\s+OBJECT|BLOB)(\s*(\s*[1-9]\d*(\s*(K|M|G))?\s*))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="DATE"/>
<xs:pattern value="DATE(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(TIME|TIME\s+WITH\s+TIME\s+ZONE)(\s*(\s*[1-9]\d*\s*))?"/>
<xs:pattern value="(TIME|TIME\s+WITH\s+TIME\s+ZONE)(\s*(\s*[1-9]\d*\s*))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="(TIMESTAMP|TIMESTAMP\s+WITH\s+TIME\s+ZONE)(\s*(\s*(0|([1-9]\d*))\s*))?"/>
<xs:pattern value="(TIMESTAMP|TIMESTAMP\s+WITH\s+TIME\s+ZONE)(\s*(\s*(0|([1-9]\d*))\s*))?(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="INTERVAL\s+(((YEAR|MONTH|DAY|HOUR|MINUTE)(\s*(\s*[1-9]\d*\s*))?(\s+TO\s+(MONTH|DAY|HOUR|MINUTE|SECOND)(\s*(\s*[1-9]\d*\s*))?)?)|(SECOND(\s*(\s*[1-9]\d*\s*(,\s*\d+\s*)?))?))"/>
<xs:pattern value="INTERVAL\s+(((YEAR|MONTH|DAY|HOUR|MINUTE)(\s*(\s*[1-9]\d*\s*))?(\s+TO\s+(MONTH|DAY|HOUR|MINUTE|SECOND)(\s*(\s*[1-9]\d*\s*))?)?)|(SECOND(\s*(\s*[1-9]\d*\s*(,\s*\d+\s*)?))?))(\s+ARRAY(\s+[[0-9]+])?)?"/>

<xs:pattern value="BOOLEAN"/>
<xs:pattern value="BOOLEAN(\s+ARRAY(\s+[[0-9]+])?)?"/>

Metadata

Metadata

Labels

SIARD2.1Issues that relates to SIARD2.1 and not the CIT Specificationfeature requestFeature request - to fully new features in the specificationsuggestionSuggestions to improvements of exisiting features in the specificationtools / implementationsIssues that relate to implementations of the SIARD format and not to the standard itself

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions