-
Notifications
You must be signed in to change notification settings - Fork 13
Make sure TensorFlow also works on offline machines #275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: laraPPr <[email protected]>
Signed-off-by: laraPPr <[email protected]>
Signed-off-by: laraPPr <[email protected]>
Signed-off-by: laraPPr <[email protected]>
|
@laraPPr why not first download the dataset in a running a test locally should be doable with the |
|
Mare Nostrum does not have login nodes with internet access. Which is a real pita. |
|
And tensorflow redowloads everytime even when it is cashed. |
did you try adding the |
No I did not try that. Will try it after the meeting. |
Signed-off-by: laraPPr <[email protected]>
Signed-off-by: laraPPr <[email protected]>
…t in configs Signed-off-by: laraPPr <[email protected]>
|
I think we need to update that CI because it is not testing the TensorFlow tests not sure why |
|
Converted to draft because I've now switched the other prs to use features instead of extras |
Signed-off-by: laraPPr <[email protected]>
| if 'offline' in self.current_partition.features: | ||
| resourcesdir = self.current_system.resourcesdir | ||
| data = os.path.join(resourcesdir, self.module_name, 'datasets/mnist.npz') | ||
| if os.path.exists(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if we can move this check to the after_init state. If so, and if this path does not exist, add -offline to the valid_systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now split the funtion up in two. The first runs after init to check if the data is their and will set the necessary environment variable if it is. If not valid_systems gets edited with the new hook from #279. The other function is still required to run after setup because we need self.current_partition.features.
…hout the data present.
| txt += 'You can download the file running tf.keras.datasets.mnist.load_data() ' | ||
| txt += f'with {self.module_name} on a system with internet access.' | ||
| utils.log(txt) | ||
| hooks.filter_valid_systems_for_offline_partitions(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't exist in this PR (maybe you forgot to push it?), but I assume it should do something like:
valid_systems = valid_systems + ['-offline']
So that if the data path does not exist, the test declares that 'this can not run on system with the offline feature', essentially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added
FEATURES.OFFLINE,
to our local config. Then, the test got skipped. From the logs:
[2025-10-23T21:45:07] debug: reframe: [check_files_for_offline_run]: Warning: will exclude TensorFlow/2.13.0-foss-2023a tests on offline partitions.
Because reframe could not find ./TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz.
You can download the file running tf.keras.datasets.mnist.load_data() with TensorFlow/2.13.0-foss-2023a on a system with internet access.
Then, I downloaded the dataset. Since I don't set a resourcesdir explicitly in my ReFrame config, it's using the current directory as default. Unfortunately, in load_data(path=...) the path is relative to the .keras/datasets folder according to the documentation https://www.tensorflow.org/api_docs/python/tf/keras/datasets/mnist/load_data#args. Fortunately, a full path is respected, so I could download it using:
mkdir -p TensorFlow/2.13.0-foss-2023a/datasets
python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data(path='/home/casparl/EESSI/test-suite/TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz')"
Note the mkdir -p: the directory needs to exist, otherwise you get a FileNotFoundError.
I think we should improve the instruction so that it prints these exact commands. ReFrame knows what it's own resourcesdir is, we can print that as part of the path.
Now the bad news: rerunning still gives me:
debug: reframe: [check_files_for_offline_run]: Warning: will exclude TensorFlow/2.13.0-foss-2023a tests on offline partitions.
Because reframe could not find ./TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz.
You can download the file running tf.keras.datasets.mnist.load_data() with TensorFlow/2.13.0-foss-2023a on a system with internet access.
Reason is that the current working dir for a test is the location of the test file:
[2025-10-23T22:13:16] debug: reframe: [check_files_for_offline_run]: Current working dir: /gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow
[2025-10-23T22:13:16] debug: reframe: [check_files_for_offline_run]: Warning: will exclude TensorFlow/2.13.0-foss-2023a tests on offline partitions.
That's another reason why it's useful if it gets printed from the test itself: if the resourcesdir is ., it can use os.getcwd() instead.
…e warning visible on stdout, not just in the log - because there it is too hidden. Finally, make sure to _only_ print the warning if there is _at least_ one offline partition
|
After 1019f21 I now ran again and got: Then, I copy-pasted the instructions: and ran again: I also checked that the warning does not get printed if none of the partitions in the current system have I do wonder two things:
|
|
I'm also not sure why I get this redefinition issue in flake8. The line number seems wrong, and I really don't see a redefinition? |
Depends on the new hook in this pr