Skip to content
This repository was archived by the owner on May 24, 2022. It is now read-only.

Commit 07693a3

Browse files
author
Bo Ferri
committed
Merge pull request #9 from janpolowinski/docu-typos-etc
Docu typos, etc.
2 parents 16518cf + aa25344 commit 07693a3

File tree

1 file changed

+23
-23
lines changed

1 file changed

+23
-23
lines changed

README.md

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -11,51 +11,51 @@
1111

1212
[![Join the chat at https://gitter.im/dswarm/dswarm](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/dswarm/dswarm?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
1313

14-
The task processing unit (TPU) is intented to process large amounts of data via [tasks](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#task) that make use of [mappings](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#mapping) that you have prepared and tested with the [D:SWARM backoffice webgui](https://github.com/dswarm/dswarm-documentation/wiki/Overview). So it can act as the production unit for D:SWARM, whereby the backoffice acts as development and/or testing unit (on smaller amounts of data).
14+
The task processing unit (TPU) is intended to process large amounts of data via [tasks](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#task) that make use of [mappings](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#mapping) that you have prepared and tested with the [D:SWARM backoffice webgui](https://github.com/dswarm/dswarm-documentation/wiki/Overview). So it can act as the production unit for D:SWARM, while the backoffice acts as development and/or testing unit (on smaller amounts of data).
1515

1616
The TPU acts as client by calling the HTTP API of the D:SWARM backend.
1717

1818
## TPU Task
1919

20-
A TPU task can consist of three parts, where by each part can be optional. These are:
21-
* ```ingest```: transforms data from a [data resource](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#data-resource) (of a certain data format, e.g. XML) with help of a [configuration](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#configuration) into a [data model](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#data-model) that makes use of a [generic data format](https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model) (so that it can be consumed by the [transformation engine](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#transformation-engine) of D:SWARM)
22-
* ```transform```: transforms data from an input data model via a task (refers to a [job](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#job)) into an output data model
23-
* ```export```: transforms data from a data model (mainly output data model) into a certain data format, e.g. XML
20+
A TPU task can consist of three parts, while each part can be optional. These are:
21+
* ```ingest```: transforms data from a [data resource](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#data-resource) (of a certain data format, e.g., XML) with the help of a [configuration](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#configuration) into a [data model](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#data-model) that makes use of a [generic data format](https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model) (so that it can be consumed by the [transformation engine](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#transformation-engine) of D:SWARM)
22+
* ```transform```: transforms data from an input data model via a task (which refers to a [job](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#job)) into an output data model
23+
* ```export```: transforms data from a data model (usually an output data model) into a certain data format, e.g., XML
2424

2525
## Processing Scenarios
2626

2727
The task processing unit can be configured for various scenarios, e.g.,
2828
* ```ingest``` (only; persistent in the [data hub](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#data-hub))
2929
* ```export``` (only; from data in the data hub)
3030
* ```ingest``` (persistent), ```transform```, ```export``` (from persistent result)
31-
* ```on-the-fly transform``` (input data will be ingested (/generated) on-the-fly + export data will be directly returned from the transformation result (without storing it in the data hub)
31+
* ```on-the-fly transform``` (input data will be ingested (/generated) on-the-fly + export data will be directly returned from the transformation result, without storing it in the data hub)
3232
* any combination of the previous scenarios ;)
3333

34-
The fastest scenario is ```on-the-fly transform```, since it doesn't store anything in the data hub and does only the pure data processing. So it's recommend for data transformation scenarios, where only the output is important, but not the archiving of the data. Currently, this scenario only supports XML export. So if you would like to have an RDF export of your transformed data, then you need to run the TPU with the parameter for persisting the task execution result in the data hub (since RDF export is only implement from there at the moment). The ```on-the-fly transform``` scenario can easily be parallelized via splitting your input data resource into serveral parts. Then each part can processed in parallel.
34+
The fastest scenario is ```on-the-fly transform```, since it doesn't store anything in the data hub and does only the pure data processing. So it's recommend for data transformation scenarios, where only the output is important, but not the archiving of the data. Currently, this scenario only supports XML export. So, if you would like to have an RDF export of your transformed data, then you need to run the TPU with the parameter for persisting the task execution result in the data hub (the current implementation of RDF export does only work in combination with the data hub). The ```on-the-fly transform``` scenario can easily be parallelized via splitting your input data resource into several parts. Then each part can processed in parallel.
3535

3636
## Requirements
3737

3838
For a (complete) TPU task execution you need to provide (at least):
3939
* a [metadata repository](https://github.com/dswarm/dswarm-documentation/wiki/Glossary#metadata-repository) that contains the projects with the mappings that you would like to include into your task
4040
* the data resource(s) that should act as input for your task execution
41-
* the output data model (schema) where the data should be mapped to (usually this can be the same as it is utilised in the projects of the mappings)
41+
* the output data model (schema) to which the data should be mapped to (usually this can be the same as the one utilized in the mapping projects)
4242

4343
## Configuration
4444

4545
You can configure a TPU task with help of a properties file (`config.properties`). You don't need to configure each property for each processing scenario (maybe the properties will be simplified a bit in the future ;) ). Here is an overview of the configuration properties:
4646

4747
````
48-
# this can be an arbitary name
48+
# this can be an arbitrary name
4949
project.name=My-TPU-project
5050
5151
##############
5252
# References #
5353
##############
5454
55-
# this folder will be utilised when processing the input data resource into an input data model, i.e., put in here all data resources that should processed in your TPU task
55+
# this folder will be utilized when processing the input data resource into an input data model, i.e., put in here all data resources that should processed in your TPU task
5656
resource.watchfolder=data/sources/resources
5757
58-
# the configuration that should be utilised to process the input data resource into an input data model
58+
# the configuration that should be utilized to process the input data resource into an input data model
5959
configuration.name=/home/user/conf/oai-pmh-marc-xml-configuration.json
6060
6161
# optional - only necessary, if init part is skipped
@@ -78,7 +78,7 @@ prototype.outputDataModelID=DataModel-cf998267-392a-4d87-a33a-88dd1bffb016
7878
# Ingest #
7979
##########
8080
81-
# enables init part (i.e. resource + data model creation)
81+
# enables init part (i.e., resource + data model creation)
8282
init.do=true
8383
8484
# if disable, task.do_ingest_on_the_fly needs to enabled
@@ -87,7 +87,7 @@ init.data_model.do_ingest=false
8787
# if enable, task.do_ingest_on_the_fly needs to be enabled
8888
init.multiple_data_models=true
8989
90-
# enables ingest (i.e. upload of data resources + ingest into given data model (in the data hub)
90+
# enables ingest (i.e., upload of data resources + ingest into given data model (in the data hub)
9191
ingest.do=true
9292
9393
#############
@@ -97,26 +97,26 @@ ingest.do=true
9797
# enables task execution (on the given data model with the given mappings into the given output data model)
9898
transform.do=true
9999
100-
# to do ingest on-the-fly at task execution time, you need to disable init and ingest part and provide a valid prototype dataModelID
100+
# to do `ingest on-the-fly` at task execution time, you need to disable the init and ingest part and provide a valid prototype dataModelID
101101
task.do_ingest_on_the_fly=true
102102
103-
# to do export on-the-fly at task execution time, you need to disable results.persistInDMP (otherwise, it would be written to the data hub)
103+
# to do `export on-the-fly` at task execution time, you need to disable results.persistInDMP (otherwise, it would be written to the data hub)
104104
# + you need to disable export part (otherwise it would be exported twice)
105105
task.do_export_on_the_fly=true
106106
107107
##########
108108
# Export #
109109
##########
110110
111-
# enables xml export (from the given output data model)
111+
# enables XML export (from the given output data model)
112112
export.do=true
113113
114114
###########
115115
# Results #
116116
###########
117117
118118
# (optionally) - only necessary, if transform part is enabled; i.e., task execution result will be stored in the data hub)
119-
# + if export part is enabled, this needs to be enabled as well (otherwise it wouldn't find any data in the data hubb for export)
119+
# + if export part is enabled, this needs to be enabled as well (otherwise it wouldn't find any data in the data hub for export)
120120
results.persistInDMP=false
121121
122122
# needs to be enable if data is export to the file system (also necessary for export on-the-fly)
@@ -132,8 +132,8 @@ results.writeDMPJson=false
132132
# Task Processing Unit #
133133
########################
134134
135-
# the number of threads that should be utilised for execution the TPU task in parallel
136-
# currently, multi-threading can only be utilised for on-the-fly transform, i.e. init.do=true + init.data_model.do_ingest=false + init.multiple_data_models=true + ingest.do=false + transform.do=true + task.do_ingest_on_the_fly=true + task.do_export_on_the_fly=true + export.do=false + results.persistInDMP=false
135+
# the number of threads that should be utilized for execution the TPU task in parallel
136+
# currently, multi-threading can only be utilized for on-the-fly transform, i.e., init.do=true + init.data_model.do_ingest=false + init.multiple_data_models=true + ingest.do=false + transform.do=true + task.do_ingest_on_the_fly=true + task.do_export_on_the_fly=true + export.do=false + results.persistInDMP=false
137137
engine.threads=1
138138
139139
# the base URL of the D:SWARM backend API
@@ -155,15 +155,15 @@ mvn clean package
155155
You can execute your TPU task with the following command:
156156

157157
$JAVA_HOME/jre/bin/java -cp TaskProcessingUnit-1.0-SNAPSHOT-onejar.jar de.tu_dortmund.ub.data.dswarm.TaskProcessingUnit -conf=conf/config.properties
158-
You need to ensure that (at least) the D:SWARM backend is running (+ (optionally) the data hub and D:SWARM graph exetension).
158+
You need to ensure that at least the D:SWARM backend is running (+ optionally, the data hub and D:SWARM graph extension).
159159

160160
## Logging
161161

162-
You can find logs of your TPU task exectutions in `[TPU HOME]/target/logs`.
162+
You can find logs of your TPU task executions in `[TPU HOME]/target/logs`.
163163

164164
## Example Configuration for On-The-Fly Transform Processing
165165

166-
The following configuration illustrates the property settings for a multi-threading ```on-the-fly transform``` processing scenario (i.e. input data ingest will be done on-the-fly before D:SWARM task execution + result export will be done immediately after the D:SWARM task execution):
166+
The following configuration illustrates the property settings for a multi-threading ```on-the-fly transform``` processing scenario (i.e., input data ingest will be done on-the-fly before D:SWARM task execution + result export will be done immediately after the D:SWARM task execution):
167167

168168
```
169169
service.name=deg-small-test-run
@@ -188,4 +188,4 @@ engine.dswarm.api=http://localhost:8087/dmp/
188188
engine.dswarm.graph.api=http://localhost:7474/graph/
189189
```
190190

191-
For this scenario the input data resource needs to be divided into multiple parts. Then each part will be executed as separate TPU task (and produce a separate export file).
191+
For this scenario, the input data resource needs to be divided into multiple parts. Then each part will be executed as a separate TPU task (and produce a separate export file).

0 commit comments

Comments
 (0)