Skip to content

deutsche-nationalbibliothek/wash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Web Archive Shapes and Schema

Web archiving is the endeavor of preserving the web. The web is born-digital, multimedial, consist of hyper text, and distributed resources. As such it is fundamentally different to other media types, that are traditionally collected in archives and libraries. To govern the archival process, a data model is required that provides the flexibility to express and describe the web archival materials properties. The Resource Description Framework (RDF) provides such a data model build on top and into the web technology stack.

Background

The object of observation is the live web, specifically Web pages and Web sites (See also Web Characterization Terminology & Definitions Sheet). A web crawl or Harvest is performed with a web crawler that is configures with seed URLs to indicate web pages that are collected. The crawler produces WARC files as output. The libraries catalog serves as an index of the holdings and as a research tool to discover the contents of the collection. As a means for structuring the overall collection sets of archive contents are grouped in individually named thematic collections.

Aim

The aim of the repository is to comprise tools to model web archive material in RDF. To create the data schema we make use of existing RDF vocabularies to increase the reuse of previous work and interlinking with the Web of Data.

Abstract Model

The abstract model is built around various concepts of the live web and web archives. Main concepts are: Crawl, Seed, WARC file, WARC record, Snapshot (related terms: memento, time slice), and Collection.

Profiles

Warning

The terms used from the DOWARC vocabulary are not yet stable. Special caution is required. You need to prepare for future work to update the terms, once things get stable.

The abstract model is realized in profiles. For the description of collections we have so far elaborated the following profiles, to cover different usage scenarios.

DC, just using Simple Dublin Core (simple DC) properties, which is helpful e.g. to implement an OAI-PMH oai_dc interface.

DC+, some simple DC properties put values into on pot, that should be distinguished and thus suffers from information loss. Those properties are expressed redundantly alongside the simple DC properties. This allows to derive the Linked Data profiles expression from the data.

Linked Data, includes the most expressive properties along with the context of the web of data. (This will maybe come in the future.)

Full, includes all properties, DC and other Linked Data vocabularies, information is stored redundantly.

About

Web Archive Shapes and Schema

Topics

Resources

Stars

Watchers

Forks

Contributors