Workflow TGDP

This repo was created in preparation of the second release of the TGDP ZuMult platform at UT Austin. It rebundles the code for preparing the TGDP data from speechislands.org for inclusion in the platform. The code takes care of:

Conversion and Annotation

Converting EAF to ISO/TEI conformant XML
Tokenisation
Language tagging
Orthographic normalisation
Normalisation
Lemmatisation
Part-of-Speech tagging (according to STTS 2.0 and Universal Dependencies)
Phonetic annotation (using the G2P web service from BAS Munich)
Speech rate annotation

Indexing

Lucene index for MTAS (for query in ZuMult)
Indexing of the COMA file (for quicker access in ZuMult)
Stats for the COMA file (for quicker access in ZuMult)

Usage

This Windows batch file (or a Linux equivalent) bundles all commands:

ConvertAnnotateIndex

Additionally, a ZuMult configuration file has to be set on the system with suitable values as follows:

<configuration>
  <backend classPath="org.zumult.backend.implementations.COMAFileSystem">              
    <tree-tagger-directory>C:\Users\bernd\Dropbox\TreeTagger</tree-tagger-directory>         
    <tree-tagger-parameter-file-german>C:\linguisticbits_nb\2021-04-16_ParameterFile_ORIGINAL_ALL_FINAL.par</tree-tagger-parameter-file-german>       
    <tree-tagger-parameter-file-english>C:\linguisticbits_nb\english.par</tree-tagger-parameter-file-english>

    <!-- Phonetic lexicons, see section 1.9 -->
    <phonetic-lexicon-german>C:\linguisticbits_nb\Lexicon_German.xml</phonetic-lexicon-german>        
    <phonetic-lexicon-english>C:\linguisticbits_nb\Lexicon_English.xml</phonetic-lexicon-english>        
    <phonetic-lexicon-other>C:\linguisticbits_nb\Lexicon_Other.xml</phonetic-lexicon-other>
  </backend>
</configuration>

Setting a ZuMult configuration is done by

Saving an XML file like the above in a suitable place in the system
Specifying the path to that file in an environment variable ZUMULT_CONFIG_PATH

The TreeTagger binary must be downloaded from https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

The two TreeTagger parameter files are part of this repository: tagger package

The three phonetic lexicons are also part of this repository: normalizer package

In the batch file, you need to adapt the variables WORKFLOW_JAR and LIB_DIRECTORY

When calling the batch file, you need to specify four parameters:

The path to the COMA file ([...]/TGDP.coma)
The path to the MTAS configuration file - part of this repository: MTAS config
The path of the directory to which the MTAS/Lucene index will be written
The name of the MTAS/Lucene index (SB_TGDP)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src/main/java/de/linguisticbits/workflow		src/main/java/de/linguisticbits/workflow
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Workflow TGDP

Conversion and Annotation

Indexing

Usage

About

Uh oh!

Releases 1

Packages

Languages

berndmoos/Workflow_TGDP

Folders and files

Latest commit

History

Repository files navigation

Workflow TGDP

Conversion and Annotation

Indexing

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages