Add option to allow newlines in captions#283
Open
achalddave wants to merge 3 commits intorom1504:mainfrom
Open
Add option to allow newlines in captions#283achalddave wants to merge 3 commits intorom1504:mainfrom
achalddave wants to merge 3 commits intorom1504:mainfrom
Conversation
The YFCC-15M descriptions can have new lines in the caption, which causes parquet's csv module to error by default. This commit allows passing --newlines-in-captions True to img2dataset, which will tell parquet to allow newlines in CSV values.
7820b66 to
0e15d4a
Compare
Owner
|
could you add an example of dataset for which this is needed please ? |
Author
|
I needed this for YFCC 100M - did you want that in the README/in the repo somewhere? |
Owner
|
yes if you could add it in https://github.com/rom1504/img2dataset/tree/main/dataset_examples it would be great |
Contributor
|
I also need this~ (I have a crawler, which gives me many raw web image-text pairs with newline in the text title). |
Owner
|
could you please rebase on head / resolve conflicts ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Some datasets (e.g., YFCC) have new lines in captions, which causes parquet's csv module to error by default. This PR allows passing
--newlines-in-captions Trueto img2dataset, which will in turn tell parquet to allow newlines in CSV values.