-
Notifications
You must be signed in to change notification settings - Fork 6
Small updates for compatibility with WebDataset pipelines #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Result of Pytest Coverage---------- coverage: platform linux, python 3.10.0-final-0 -----------
|
6af7f96 to
cc55aa4
Compare
MaxiBoether
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! Let's clean this up a bit
| import os | ||
| import argparse | ||
| import requests | ||
| import os | ||
| from concurrent.futures import ThreadPoolExecutor, as_completed | ||
|
|
||
| import requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary changes, no?
| "CC12MDataset", | ||
| "MSCOCODataset", | ||
| "LAION400MDataset", | ||
| "COYO700MDataset", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, this must change
| class MixteraDataPipeline(wds.DataPipeline): | ||
| """ | ||
| Supports building arbitrary webdataset pipelines with Mixtera's `MixteraTorchDataset` as the data source. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| client: MixteraClient, | ||
| query: Query, | ||
| query_execution_args: QueryExecutionArgs, | ||
| result_streaming_args: ResultStreamingArgs, | ||
| pipeline: Iterable[Any], | ||
| ): | ||
| super().__init__(*pipeline) | ||
| self.client = client | ||
| self.query = query | ||
| self.query_execution_args = query_execution_args | ||
| self.result_streaming_args = result_streaming_args | ||
|
|
||
| torch_dataset = MixteraTorchDataset( | ||
| client=client, | ||
| query=query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is necessary for. The MixteraTorchDataset is a user facing abstraction
| def decode(sample: dict[str, Any], decode_image: bool = True) -> dict[str, Any]: | ||
| def decode_sample(sample: dict[str, Any]) -> dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird to have a funciton called decode_sample which you can optionally tell to not decode. Can you find a cleaner way to disable image decoding?
| class MMIndexedTarRawBytes(MMIndexedTar): | ||
| """ | ||
| A subclass of `MMIndexedTar` that returns the raw bytes instead of an IOBytes object. | ||
| """ | ||
|
|
||
| def get_file(self, i: int) -> tuple[str, bytes]: | ||
| filename, data = self.get_at_index(i) | ||
| return filename, data | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that necessary?
| return self.decoder(sample) | ||
| raise ValueError("Co") | ||
| sample[ext[1:]] = data | ||
| sample["__key__"] = key # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why type ignore?
| self.add_metadata(sample_id=line_number, dataset=dataset_name) | ||
|
|
||
|
|
||
| class DomainNetMetadataParser(MetadataParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to clean up the datasets vs metadata parsers here. Not sure why in addition to the datasets you added you also added this parser. Let's find a cleaner abstraction for this. You could have like a "LlavaMetadataParser" that somehow e.g. from the path infers which dataset the sample comes from. Or you have one MetadataParser per dataset. You can actually register multiple datasets within the same data collection so that should also not be an issue.
No description provided.