-
Notifications
You must be signed in to change notification settings - Fork 73
Closed
Description
The generic file-level tokenizer (tokenizers/file-level) has problems with deep hierarchy of project folders and their subfolders.
Let's say I have input dataset of files for tokenization in "project-folder" (PATH_proj_paths=project-folder) and it looks like this:
$ tree project-folder
project-folder
|-- sub
| |-- subsub
| | `-- index.js
| `-- util.js
`-- test2.js
2 directories, 3 files
When I run python tokenizer.py folder, it does find all the files in subfolders, however, it tries to tokenize the found filenames from the root directory:
[INFO] (MainThread) File projects_success.txt no found
[INFO] (MainThread) Process 1
[INFO] (MainThread) Starting file <3,0,project-folder/test2.js>
[INFO] (MainThread) Starting file <3,1,project-folder/util.js>
[ERROR] (MainThread) File not found <3,1,project-folder/util.js>
[INFO] (MainThread) Starting file <3,2,project-folder/index.js>
[ERROR] (MainThread) File not found <3,2,project-folder/index.js>
I am submitting a PR with a fix. (cc @pedromartins4)
Metadata
Metadata
Assignees
Labels
No labels