Skip to content
This repository was archived by the owner on May 24, 2022. It is now read-only.

Commit 16518cf

Browse files
committed
Merge pull request #7 from zazi/new_docu
new docu (readme)
2 parents a06f16e + b63b541 commit 16518cf

File tree

1 file changed

+160
-61
lines changed

1 file changed

+160
-61
lines changed

README.md

Lines changed: 160 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -5,88 +5,187 @@
55
![UB Dortmund Logo](http://www.ub.tu-dortmund.de/images/ub-schriftzug.jpg)
66

77
---
8+
(in cooperation with [SLUB Dresden](http://slub-dresden.de) + [Avantgarde Labs](http://avantgarde-labs.de))
89

9-
# Task Processing Unit für d:swarm
10+
# Task Processing Unit for [D:SWARM](http://dswarm.org)
1011

11-
Die *Task Processing Unit* geht von folgenden Annahmen aus:
12+
[![Join the chat at https://gitter.im/dswarm/dswarm](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dswarm/dswarm?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
1213

13-
* Es gibt innerhalb der D:SWARM-Plattform ein Projekt, welches "repräsentativ" ein Mapping für eine größere Menge von Quelldateien konfiguriert.
14-
* Die im Prozess erzeugten *Resources* und *Data Models* zu den Quellen werden nach - erfolgreicher aber auch nach nicht erfolgreicher - Transformation aus der Plattform gelöscht (verhindert "Aufblähen" der Listen im Bereich "Data" des WebUI).
14+
The task processing unit (TPU) is intented to process large amounts of data via [tasks](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#task) that make use of [mappings](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#mapping) that you have prepared and tested with the [D:SWARM backoffice webgui](https://github.com/dswarm/dswarm-documentation/wiki/Overview). So it can act as the production unit for D:SWARM, whereby the backoffice acts as development and/or testing unit (on smaller amounts of data).
1515

16-
## Konfiguration eines Proezesses
16+
The TPU acts as client by calling the HTTP API of the D:SWARM backend.
1717

18-
Für die Konfiguration eines Prozesses müssen folgende Parameter in der `config.properties` angepasst werden:
18+
## TPU Task
1919

20-
project.name=CrossRef
21-
22-
# resources
23-
resource.watchfolder=data/sources
24-
resource.preprocessing=true
25-
26-
# preprocessing for xml files
27-
preprocessing.xslt=xslt/cdata.xsl
28-
preprocessing.folder=data/tmp
29-
30-
# prototype project
31-
prototype.dataModelID=bbd368e8-b75c-0e64-b96a-ab812a700b4f
32-
prototype.projectID=819f2f6e-98ed-90e2-372e-71a0a1eec786
33-
prototype.outputDataModelID=DataModel-cf998267-392a-4d87-a33a-88dd1bffb016
34-
35-
# results
36-
results.persistInDMP=false
37-
results.persistInFolder=true
38-
results.folder=data/results
20+
A TPU task can consist of three parts, where by each part can be optional. These are:
21+
* ```ingest```: transforms data from a [data resource](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#data-resource) (of a certain data format, e.g. XML) with help of a [configuration](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#configuration) into a [data model](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#data-model) that makes use of a [generic data format](https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model) (so that it can be consumed by the [transformation engine](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#transformation-engine) of D:SWARM)
22+
* ```transform```: transforms data from an input data model via a task (refers to a [job](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#job)) into an output data model
23+
* ```export```: transforms data from a data model (mainly output data model) into a certain data format, e.g. XML
3924

40-
## Ausführen eines Prozesses
25+
## Processing Scenarios
4126

42-
$JAVA_HOME/jre/bin/java -cp TaskProcessingUnit-1.0-SNAPSHOT-onejar.jar de.tu_dortmund.ub.data.dswarm.TaskProcessingUnit -conf=conf/config.properties
43-
27+
The task processing unit can be configured for various scenarios, e.g.,
28+
* ```ingest``` (only; persistent in the [data hub](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#data-hub))
29+
* ```export``` (only; from data in the data hub)
30+
* ```ingest``` (persistent), ```transform```, ```export``` (from persistent result)
31+
* ```on-the-fly transform``` (input data will be ingested (/generated) on-the-fly + export data will be directly returned from the transformation result (without storing it in the data hub)
32+
* any combination of the previous scenarios ;)
33+
34+
The fastest scenario is ```on-the-fly transform```, since it doesn't store anything in the data hub and does only the pure data processing. So it's recommend for data transformation scenarios, where only the output is important, but not the archiving of the data. Currently, this scenario only supports XML export. So if you would like to have an RDF export of your transformed data, then you need to run the TPU with the parameter for persisting the task execution result in the data hub (since RDF export is only implement from there at the moment). The ```on-the-fly transform``` scenario can easily be parallelized via splitting your input data resource into serveral parts. Then each part can processed in parallel.
35+
36+
## Requirements
37+
38+
For a (complete) TPU task execution you need to provide (at least):
39+
* a [metadata repository](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#metadata-repository) that contains the projects with the mappings that you would like to include into your task
40+
* the data resource(s) that should act as input for your task execution
41+
* the output data model (schema) where the data should be mapped to (usually this can be the same as it is utilised in the projects of the mappings)
42+
43+
## Configuration
44+
45+
You can configure a TPU task with help of a properties file (`config.properties`). You don't need to configure each property for each processing scenario (maybe the properties will be simplified a bit in the future ;) ). Here is an overview of the configuration properties:
46+
47+
````
48+
# this can be an arbitary name
49+
project.name=My-TPU-project
50+
51+
##############
52+
# References #
53+
##############
54+
55+
# this folder will be utilised when processing the input data resource into an input data model, i.e., put in here all data resources that should processed in your TPU task
56+
resource.watchfolder=data/sources/resources
57+
58+
# the configuration that should be utilised to process the input data resource into an input data model
59+
configuration.name=/home/user/conf/oai-pmh-marc-xml-configuration.json
60+
61+
# optional - only necessary, if init part is skipped
62+
prototype.resourceID=Resource-f2b9e085-5b05-4853-ad82-06ce4fe1952d
63+
64+
# input data model id (optional - only necessary, if init part is skipped)
65+
prototype.dataModelID=bbd368e8-b75c-0e64-b96a-ab812a700b4f
66+
67+
# optionally, only necessary, if transformation part is enabled
68+
# (for legacy reason) if one project delivers all mappings for the tasks
69+
prototype.projectID=819f2f6e-98ed-90e2-372e-71a0a1eec786
4470
45-
## Algorithmus
71+
# if multiple projects deliver the mappings for the task
72+
prototype.projectIDs=9d6ec288-f1bf-4f96-78f6-5399e3050125,69664ba5-bbe5-6f35-7a77-47bacf9d3731
4673
47-
### Gegeben
74+
# the output data model refers to the output schema as well
75+
prototype.outputDataModelID=DataModel-cf998267-392a-4d87-a33a-88dd1bffb016
4876
49-
* uuid des Datenmodells zum "Prototyp"-Projekts
50-
* uuid des "Prototyp"-Projekts
51-
* uuid des Zielschemas
77+
##########
78+
# Ingest #
79+
##########
5280
53-
### Aufgabe
81+
# enables init part (i.e. resource + data model creation)
82+
init.do=true
5483
55-
Transformiere jede Datei aus einem definierten Quellverzeichnis mittels des Mappings eines ausgewählten "Prototyp"-Projekts
56-
in das ausgewählte Zielschema und speichere die Resultate in ein definiertes Zielverzeichnis
84+
# if disable, task.do_ingest_on_the_fly needs to enabled
85+
init.data_model.do_ingest=false
5786
58-
### Verfahren
87+
# if enable, task.do_ingest_on_the_fly needs to be enabled
88+
init.multiple_data_models=true
5989
60-
**1. Schritt:** Erzeuge für jede Quelldatei eine *InputDataModell*
90+
# enables ingest (i.e. upload of data resources + ingest into given data model (in the data hub)
91+
ingest.do=true
6192
62-
* (a) Upload der Datei via `POST {engine.dswarm.api}/resources/`; ggf. vorher *Preprocessing*
63-
* (b) Ermitteln der ID zur Ressource zum Datenmodells zum "Prototyp"-Projekts via `GET {engine.dswarm.api}/datamodels/{uuid des Datenmodels zum "Prototyp"-Projekt}`
64-
* (c) Lese die Konfiguration der Ressource zum Datenmodells zum "Prototyp"-Projekts via `GET {engine.dswarm.api}/resources/{uuid der "Prototyp"-Ressource}/configurations`
65-
* (d) Konfiguration der Datei mit angepassten Daten via `POST {engine.dswarm.api}/resources/{uuid der neuen Ressource}/configurations`
66-
* (e) Definition des Datenmodells via `POST {engine.dswarm.api}/datamodels`
93+
#############
94+
# Transform #
95+
#############
6796
68-
**2. Schritt:** Erzeuge für jede Quelldatei ein *Task*
97+
# enables task execution (on the given data model with the given mappings into the given output data model)
98+
transform.do=true
6999
70-
* (a) Hole aus dem ausgwählten "Prototyp"-Projekt die Informationen zum Mapping
71-
* (b) hole die Konfiguration zum *InputDataModell* mittels `GET {engine.dswarm.api}/datamodels/{uuid}`
72-
* (c) Hole die Konfiguration zum ausgewählten Zielschema mittels `GET {engine.dswarm.api}/datamodels/{uuid}`
73-
* (d) Baue den *Task* zusammen
100+
# to do ingest on-the-fly at task execution time, you need to disable init and ingest part and provide a valid prototype dataModelID
101+
task.do_ingest_on_the_fly=true
74102
75-
*Task* JSON:
103+
# to do export on-the-fly at task execution time, you need to disable results.persistInDMP (otherwise, it would be written to the data hub)
104+
# + you need to disable export part (otherwise it would be exported twice)
105+
task.do_export_on_the_fly=true
76106
77-
{
78-
"name" : "my task",
79-
"description" : "my task description",
80-
"job" : {
81-
"mappings" : [[[[INSERT HERE THE MAPPINGS ARRAY FROM YOUR PROJECT]]]],
82-
"uuid" : "[[[[INSERT HERE A UUID]]]]"
83-
},
84-
"input_data_model" : [[[[INSERT HERE THE INPUT DATA MODEL RETRIEVED FROM THE DATA MODELS ENDPOINT]]]],
85-
"output_data_model" : [[[[INSERT HERE THE OUTPUT DATA MODEL ((OPTIONALLY) RETRIEVED FROM THE DATA MODELS ENDPOINT)]]]]
86-
}
107+
##########
108+
# Export #
109+
##########
87110
111+
# enables xml export (from the given output data model)
112+
export.do=true
88113
89-
**3. Schritt:** Führe den *Task* mittels `POST {engine.dswarm.api}/tasks?persist={result.persistInDMP}` aus
114+
###########
115+
# Results #
116+
###########
90117
91-
**4. Schritt:** Verarbeite ggf. das Ergebnis-JSON (falls `result.persistInFolder=true`)
118+
# (optionally) - only necessary, if transform part is enabled; i.e., task execution result will be stored in the data hub)
119+
# + if export part is enabled, this needs to be enabled as well (otherwise it wouldn't find any data in the data hubb for export)
120+
results.persistInDMP=false
92121
122+
# needs to be enable if data is export to the file system (also necessary for export on-the-fly)
123+
results.persistInFolder=true
124+
125+
# the folder where the transformation result or export should be stored
126+
results.folder=data/target/results
127+
128+
# should be disabled, otherwise the task execution will return JSON
129+
results.writeDMPJson=false
130+
131+
########################
132+
# Task Processing Unit #
133+
########################
134+
135+
# the number of threads that should be utilised for execution the TPU task in parallel
136+
# currently, multi-threading can only be utilised for on-the-fly transform, i.e. init.do=true + init.data_model.do_ingest=false + init.multiple_data_models=true + ingest.do=false + transform.do=true + task.do_ingest_on_the_fly=true + task.do_export_on_the_fly=true + export.do=false + results.persistInDMP=false
137+
engine.threads=1
138+
139+
# the base URL of the D:SWARM backend API
140+
engine.dswarm.api=http://example.com/dmp/
141+
142+
# the base URL of the D:SWARM graph extension
143+
engine.dswarm.graph.api=http://example.com/graph/
144+
145+
````
146+
147+
## Execution
148+
149+
You can build the TPU with the following command (only required once, or when TPU code was updated):
150+
151+
````
152+
mvn clean package
153+
````
154+
155+
You can execute your TPU task with the following command:
156+
157+
$JAVA_HOME/jre/bin/java -cp TaskProcessingUnit-1.0-SNAPSHOT-onejar.jar de.tu_dortmund.ub.data.dswarm.TaskProcessingUnit -conf=conf/config.properties
158+
You need to ensure that (at least) the D:SWARM backend is running (+ (optionally) the data hub and D:SWARM graph exetension).
159+
160+
## Logging
161+
162+
You can find logs of your TPU task exectutions in `[TPU HOME]/target/logs`.
163+
164+
## Example Configuration for On-The-Fly Transform Processing
165+
166+
The following configuration illustrates the property settings for a multi-threading ```on-the-fly transform``` processing scenario (i.e. input data ingest will be done on-the-fly before D:SWARM task execution + result export will be done immediately after the D:SWARM task execution):
167+
168+
```
169+
service.name=deg-small-test-run
170+
project.name=degsmalltest
171+
resource.watchfolder=/data/source-data/DEG-small
172+
configuration.name=/home/dmp/config/oai-pmh-marc-xml-configuration.json
173+
prototype.projectIDs=9d6ec288-f1bf-4f96-78f6-5399e3050125,69664ba5-bbe5-6f35-7a77-47bacf9d3731
174+
prototype.outputDataModelID=5fddf2c5-916b-49dc-a07d-af04020c17f7
175+
init.do=true
176+
init.data_model.do_ingest=false
177+
init.multiple_data_models=true
178+
ingest.do=false
179+
transform.do=true
180+
task.do_ingest_on_the_fly=true
181+
task.do_export_on_the_fly=true
182+
export.do=false
183+
results.persistInDMP=false
184+
results.persistInFolder=true
185+
results.folder=/home/dmp/data/degsmalltest/results
186+
engine.threads=10
187+
engine.dswarm.api=http://localhost:8087/dmp/
188+
engine.dswarm.graph.api=http://localhost:7474/graph/
189+
```
190+
191+
For this scenario the input data resource needs to be divided into multiple parts. Then each part will be executed as separate TPU task (and produce a separate export file).

0 commit comments

Comments
 (0)