Skip to content

Bug with traversing subfolders in file-level tokenizer #8

@jakubzitny

Description

@jakubzitny

The generic file-level tokenizer (tokenizers/file-level) has problems with deep hierarchy of project folders and their subfolders.

Let's say I have input dataset of files for tokenization in "project-folder" (PATH_proj_paths=project-folder) and it looks like this:

$ tree project-folder
project-folder
|-- sub
|   |-- subsub
|   |   `-- index.js
|   `-- util.js
`-- test2.js

2 directories, 3 files

When I run python tokenizer.py folder, it does find all the files in subfolders, however, it tries to tokenize the found filenames from the root directory:

[INFO] (MainThread) File projects_success.txt no found
[INFO] (MainThread) Process 1
[INFO] (MainThread) Starting file <3,0,project-folder/test2.js>
[INFO] (MainThread) Starting file <3,1,project-folder/util.js>
[ERROR] (MainThread) File not found <3,1,project-folder/util.js>
[INFO] (MainThread) Starting file <3,2,project-folder/index.js>
[ERROR] (MainThread) File not found <3,2,project-folder/index.js>

I am submitting a PR with a fix. (cc @pedromartins4)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions