Skip to content

Conversation

phoeenniixx
Copy link
Member

@phoeenniixx phoeenniixx commented Apr 6, 2025

Description

This PR adds example notebook for the new v2 data pipeline vignette, having the basic implementation of the tft model using this version. For more info see #1812 , #1811

Colab link: https://colab.research.google.com/drive/148MyhcNfYEh4CZ6vBXLqQNsUBF0n6_0v?usp=sharing

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@phoeenniixx
Copy link
Member Author

Hi @fkiraly, I am getting this error:
image

I just downloaded the notebook from colab and pasted it in the repo, is there anything else I should do to avoid this? Really have no idea 😅

Copy link

codecov bot commented Apr 11, 2025

Codecov Report

❌ Patch coverage is 9.09091% with 10 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@4a34931). Learn more about missing BASE report.

Files with missing lines Patch % Lines
pytorch_forecasting/data/examples.py 9.09% 10 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1813   +/-   ##
=======================================
  Coverage        ?   85.59%           
=======================================
  Files           ?       68           
  Lines           ?     6597           
  Branches        ?        0           
=======================================
  Hits            ?     5647           
  Misses          ?      950           
  Partials        ?        0           
Flag Coverage Δ
cpu 85.59% <9.09%> (?)
pytest 85.59% <9.09%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@phoeenniixx
Copy link
Member Author

Thank you for the review @xandie985!

EncoderDecoderTimeSeriesDataModule assumes that the data would fit int he memory. How would your approach scale to datasets that do not fit into RAM? Are there plans to incorporate memory-efficient loading strategies like chunking or on-demand loading from disk?

So right now we assume that data could fit in the memory, but yes, in future we plan to add features like chunking, on-demand loading etc

  1. Does your module have any mechanisms to detect or correct irregular time series within its scope?
  2. Your _create_windows method checks for sufficient sequence length.. how does your module handle missing values within a window, and what are the potential consequences for model training if windows contain significant NaNs?
  3. Adding random seed for reproducibility, especially in setup() where random shuffling takes place.

These are some open questions, we still need to work on - We will tackle these questions in future iterations once an end-to-end prototype is ready and we get some reviews from the users of the package on this prototype.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More detailed review.

  • please remove the install from the start of the notebook
  • we should test that this is running, while we are working on v2. One way is to move the content to docs/examples/tutorials, the contents of which are automatically run an tested.
  • the data generation cell is useful, but not too illustrative. Can you move the code to a function load_toydata or similar, in pytorch_forecasting.data, new module, e.g., toydata? Then we can also use this in testing later!
  • can you add basic markdown cells that explain what the notebook is showing, and what each steps are? E.g., a summary at the top of the multiple steps, and then again small headers for the steps with minimal explanations.

@fkiraly fkiraly added the documentation Improvements or additions to documentation label May 29, 2025
@phoeenniixx
Copy link
Member Author

Thanks! I would make the changes accordingly, Just one doubt:

the data generation cell is useful, but not too illustrative. Can you move the code to a function load_toydata or similar, in pytorch_forecasting.data, new module, e.g., toydata? Then we can also use this in testing later!

I think we can add it to pytorch_forecasting.data.examples? Like right now people import get_stallion_data from there, so they can import toydata from there as well. Just think that this would follow an already available mapping of "test data" to examples...

@fkiraly
Copy link
Collaborator

fkiraly commented May 30, 2025

Like right now people import get_stallion_data from there, so they can import toydata from there as well.

Makes sense, to add it to the established location with data loaders.

Would it make sense to split the file up and have on loader per file? Need not be done in this PR.

@phoeenniixx
Copy link
Member Author

Would it make sense to split the file up and have on loader per file?

Then I think we need to should create a new folder called loaders or datasets and have these files there, and we can add more loaders to that folder in future

@fkiraly fkiraly moved this from PR under review to PR in progress in May - Sep 2025 mentee projects May 30, 2025
@fkiraly
Copy link
Collaborator

fkiraly commented May 31, 2025

Then I think we need to should create a new folder called loaders or datasets and have these files there, and we can add more loaders to that folder in future

Maybe a separate PR though.

@fkiraly fkiraly moved this from PR in progress to PR under review in May - Sep 2025 mentee projects Jun 4, 2025
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Some minor change requests only.

  • in the header "data pipeline" - should it not be training and inference?
  • please remove unused imports
  • I would also suggest to move imports to the cells where they are used
  • at the start, can you summarize what is shown over the entire notebook in a few bullet points? Mostly just a list of the headers (table of contents and signposting)
  • can you explain usage of the important objects? Use markdown or in-line comments
    • most important arguments of objects such as of TimeSeries
    • explain and show the types and structure of important returns such as y_pred
  • I would split the last cell into multiple parts, too much is happening there
  • rule of thumb, cells should be max 15 lines, and printouts max 10 lines. There should be descriptive content, even if very minimal, at the start or before the cell in a markdown.

@phoeenniixx
Copy link
Member Author

phoeenniixx commented Jun 5, 2025

in the header "data pipeline" - should it not be training and inference?

I mean it is a basic vignette of "data pipeline", how data flow might look like in v2? Should I add words "training" and "inference" as well there?

Model training is just to "complete" the process

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 5, 2025

I mean it is a basic vignette of "data pipeline", how data flow might look like in v2? Should I add words "training" and "inference" as well there?

Data pipeline is not accurate imo - people expect pre-processing or ETL if they hear that.

But in fact this is the full basic workflow for actually using the neural networks for forecasting.

@fkiraly fkiraly moved this from PR under review to PR in progress in May - Sep 2025 mentee projects Jun 6, 2025
@phoeenniixx phoeenniixx requested a review from fkiraly June 6, 2025 20:47
@phoeenniixx
Copy link
Member Author

Hi @fkiraly, will this work?

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, great!

Made some minor changes to the header to make this clearer.

@fkiraly fkiraly merged commit 4ba4ff5 into sktime:main Jun 7, 2025
18 checks passed
@github-project-automation github-project-automation bot moved this from PR in progress to Done in May - Sep 2025 mentee projects Jun 7, 2025
@github-project-automation github-project-automation bot moved this from PR in progress to Done in Dec 2024 - Mar 2025 mentee projects Jun 7, 2025
@phoeenniixx phoeenniixx deleted the refactor-notebook branch June 8, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Development

Successfully merging this pull request may close these issues.

3 participants