Skip to content

Conversation

@laraPPr
Copy link
Collaborator

@laraPPr laraPPr commented Jul 10, 2025

@laraPPr laraPPr marked this pull request as draft July 10, 2025 14:04
laraPPr added 2 commits July 14, 2025 11:06
Signed-off-by: laraPPr <[email protected]>
@laraPPr laraPPr changed the title [WIP] Make sure TensorFlow also works on offline machines Make sure TensorFlow also works on offline machines Jul 14, 2025
@laraPPr laraPPr marked this pull request as ready for review July 14, 2025 09:51
@smoors
Copy link
Collaborator

smoors commented Jul 31, 2025

@laraPPr why not first download the dataset in a local test, i.e. not on the compute nodes but on the node where reframe is launched?
the only requirement is that this node has internet access, which is usually the case, as most people will launch reframe on a login node.
an added benefit is that the dataset is download only once for all the concrete test cases.

running a test locally should be doable with the local parameter, see:
https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.local

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jul 31, 2025

Mare Nostrum does not have login nodes with internet access. Which is a real pita.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jul 31, 2025

And tensorflow redowloads everytime even when it is cashed.

@smoors
Copy link
Collaborator

smoors commented Jul 31, 2025

And tensorflow redowloads everytime even when it is cashed.

did you try adding the path parameter to tf.keras.datasets.mnist.load_data?

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jul 31, 2025

And tensorflow redowloads everytime even when it is cashed.

did you try adding the path parameter to tf.keras.datasets.mnist.load_data?

No I did not try that. Will try it after the meeting.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jul 31, 2025

I think we need to update that CI because it is not testing the TensorFlow tests not sure why Tests for EESSI test suite, using EESSI production repo / test_with_eessi_pilot (2023.06)

@laraPPr laraPPr marked this pull request as draft August 14, 2025 10:42
@laraPPr
Copy link
Collaborator Author

laraPPr commented Aug 14, 2025

Converted to draft because I've now switched the other prs to use features instead of extras

@laraPPr laraPPr marked this pull request as ready for review August 14, 2025 10:55
if 'offline' in self.current_partition.features:
resourcesdir = self.current_system.resourcesdir
data = os.path.join(resourcesdir, self.module_name, 'datasets/mnist.npz')
if os.path.exists(data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if we can move this check to the after_init state. If so, and if this path does not exist, add -offline to the valid_systems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now split the funtion up in two. The first runs after init to check if the data is their and will set the necessary environment variable if it is. If not valid_systems gets edited with the new hook from #279. The other function is still required to run after setup because we need self.current_partition.features.

txt += 'You can download the file running tf.keras.datasets.mnist.load_data() '
txt += f'with {self.module_name} on a system with internet access.'
utils.log(txt)
hooks.filter_valid_systems_for_offline_partitions(self)
Copy link
Collaborator

@casparvl casparvl Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't exist in this PR (maybe you forgot to push it?), but I assume it should do something like:

valid_systems = valid_systems + ['-offline']

So that if the data path does not exist, the test declares that 'this can not run on system with the offline feature', essentially

Copy link
Collaborator

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added

FEATURES.OFFLINE,

to our local config. Then, the test got skipped. From the logs:

[2025-10-23T21:45:07] debug: reframe: [check_files_for_offline_run]: Warning: will exclude TensorFlow/2.13.0-foss-2023a tests on offline partitions.
Because reframe could not find ./TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz.
You can download the file running tf.keras.datasets.mnist.load_data() with TensorFlow/2.13.0-foss-2023a on a system with internet access.

Then, I downloaded the dataset. Since I don't set a resourcesdir explicitly in my ReFrame config, it's using the current directory as default. Unfortunately, in load_data(path=...) the path is relative to the .keras/datasets folder according to the documentation https://www.tensorflow.org/api_docs/python/tf/keras/datasets/mnist/load_data#args. Fortunately, a full path is respected, so I could download it using:

mkdir -p TensorFlow/2.13.0-foss-2023a/datasets
python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data(path='/home/casparl/EESSI/test-suite/TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz')"

Note the mkdir -p: the directory needs to exist, otherwise you get a FileNotFoundError.

I think we should improve the instruction so that it prints these exact commands. ReFrame knows what it's own resourcesdir is, we can print that as part of the path.

Now the bad news: rerunning still gives me:

debug: reframe: [check_files_for_offline_run]: Warning: will exclude TensorFlow/2.13.0-foss-2023a tests on offline partitions.
Because reframe could not find ./TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz.
You can download the file running tf.keras.datasets.mnist.load_data() with TensorFlow/2.13.0-foss-2023a on a system with internet access.

Reason is that the current working dir for a test is the location of the test file:

[2025-10-23T22:13:16] debug: reframe: [check_files_for_offline_run]: Current working dir: /gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow
[2025-10-23T22:13:16] debug: reframe: [check_files_for_offline_run]: Warning: will exclude TensorFlow/2.13.0-foss-2023a tests on offline partitions.

That's another reason why it's useful if it gets printed from the test itself: if the resourcesdir is ., it can use os.getcwd() instead.

…e warning visible on stdout, not just in the log - because there it is too hidden. Finally, make sure to _only_ print the warning if there is _at least_ one offline partition
@casparvl
Copy link
Collaborator

After 1019f21 I now ran again and got:

WARNING: Will exclude EESSI_TensorFlow %scale=16_nodes %module_name=TensorFlow/2.13.0-foss-2023a %device_type=gpu tests on offline partitions.
Because reframe could not find /gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz.
The download the data, please run:.
mkdir -p /gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets && module load TensorFlow/2.13.0-foss-2023a && python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data('/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz')"
on a system with internet access.

Then, I copy-pasted the instructions:

(eessi_test_venv) [casparl@int6 test-suite]$ mkdir -p /gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets && module load TensorFlow/2.13.0-foss-2023a && python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data('/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz')"
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 1s 0us/step

and ran again:

(eessi_test_venv) [casparl@int6 test-suite]$ reframe -c eessi/testsuite/tests/apps/tensorflow/ -t 1_node -t CI --run --system snellius:rome
...
[ RUN      ] EESSI_TensorFlow %scale=1_node %module_name=TensorFlow/2.13.0-foss-2023a %device_type=cpu /9864d0f5 @snellius:rome+default
[       OK ] (1/1) EESSI_TensorFlow %scale=1_node %module_name=TensorFlow/2.13.0-foss-2023a %device_type=cpu /9864d0f5 @snellius:rome+default
P: perf: 59760.135314296705 img/s (r:0, l:None, u:None)

I also checked that the warning does not get printed if none of the partitions in the current system have FEATURES.OFFLINE.

I do wonder two things:

  1. It would be nice if we can only print the warning once. Now, we get it printed for every test instance, that is not really needed and very verbose. It also doesn't scale very well if we get more of such tests. Maybe we can make a utils function that keeps some global state that 'this error' has already printed. Of course, that means 'this error' needs some kind of unique signature, I'm not immediately sure how to do that. But if we have something like that, we could make (and call) a utils or hooks function to print the error, and then implement the logic there so that it only gets done once.
  2. We now do the warning in the init stage, and use a feature to make sure the test doesn't get generated. Should we push it to the setup stage, so that we can query if the current_partition has an offline feature, and then use a skip-if? The test instance then gets generated, but in the skip-if we can easily print the instruction. This ensures that it only gets printed if we are actually trying to schedule the test on an offline partition. The current logic will -also- print it if the system contains one offline partition, but we current try to schedule to another partition. That may be slightly confusing.

@casparvl
Copy link
Collaborator

I'm also not sure why I get this redefinition issue in flake8. The line number seems wrong, and I really don't see a redefinition?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants