Skip to content

preprocessing before triggering 'preprocess.sh' for ontonotes #27

@marc88

Description

@marc88

Hello,
Can anyone suggest on the data processing to be done on conll2012 before calling the following?

./bin/preprocess.sh conf/ontonotes/dilated-cnn.conf
Currently, simply calling the preprocess.sh script as above, does not write anything to the file mentioned below and goes into an infinite loop I suppose.
data/vocabs/ontonotes_cutoff_4.txt

I've downloaded the train v4, dev v4 and test v9 tarballs from
http://conll.cemantix.org/2012/data.html

Edit:
I could convert the ontonotes files successfully to conll format but not sure of the directory structure to trigger the preprocessing script. Can you help?
The following is my directory structure:

$DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0

*structure for $DILATED_CNN_NER_ROOT/data/conll-formatted-ontonotes-5.0 ( this directory has all the _gold_conll files. Please take a direcotry below as an example:
/home/ss06886910/Strubel_IDCNN/data/conll-formatted-ontonotes-5.0/data/train/data/english/annotations/wb/c2e/00/c2e_0028.v4_gold_conll)

conll-formatted-ontonotes-5.0
├── data
│   ├── development
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   ├── test
│   │   └── data
│   │       ├── arabic
│   │       │   └── annotations
│   │       ├── chinese
│   │       │   └── annotations
│   │       └── english
│   │           └── annotations
│   └── train
│       └── data
│           ├── arabic
│           │   └── annotations
│           ├── chinese
│           │   └── annotations
│           └── english
│               └── annotations
└── scripts

Tried running with the following parameter in ontonotes.conf ;
export raw_data_dir="$DATA_DIR/conll-formatted-ontonotes-5.0/data"
($DATA_DIR = $DILATED_CNN_NER_ROOT/data)

And, I get the following error:

Processing file: data/conll-formatted-ontonotes-5.0/data/development
python /home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py --in_file data/conll-formatted-ontonotes-5.0/data/development --out_dir /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/development --window_size 3 --update_maps False --dataset ontonotes --update_vocab /home/ss06886910/Strubel_IDCNN/data/vocabs/ontonotes_cutoff_4.txt --vocab /home/ss06886910/Strubel_IDCNN/data/embeddings/lample-embeddings-pre.txt --labels /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/label.txt --shapes /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/shape.txt --chars /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/char.txt
Embeddings coverage: 98.67%
Processing file: data/conll-formatted-ontonotes-5.0/data/test
python /home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py --in_file data/conll-formatted-ontonotes-5.0/data/test --out_dir /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/test --window_size 3 --update_maps False --dataset ontonotes --update_vocab /home/ss06886910/Strubel_IDCNN/data/vocabs/ontonotes_cutoff_4.txt --vocab /home/ss06886910/Strubel_IDCNN/data/embeddings/lample-embeddings-pre.txt --labels /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/label.txt --shapes /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/shape.txt --chars /home/ss06886910/Strubel_IDCNN/data/ontonotes-w3-lample/train/char.txt
Traceback (most recent call last):
  File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 498, in <module>
    tf.app.run()
  File "/home/ss06886910/IDCNN/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 494, in main
    tsv_to_examples()
  File "/home/ss06886910/Strubel_IDCNN/src/tsv_to_tfrecords.py", line 487, in tsv_to_examples
    print("Embeddings coverage: %2.2f%%" % ((1-(num_oov/num_tokens)) * 100))
ZeroDivisionError: division by zero

Regards

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions