Make sure TensorFlow also works on offline machines #275

laraPPr · 2025-07-10T14:04:07Z

Depends on the new hook in this pr

skip tests which require internet access on offline machines #279

…ow test

Signed-off-by: laraPPr <[email protected]>

eessi/testsuite/tests/apps/tensorflow/src/mnist_setup.py

smoors · 2025-07-31T09:59:29Z

@laraPPr why not first download the dataset in a local test, i.e. not on the compute nodes but on the node where reframe is launched?
the only requirement is that this node has internet access, which is usually the case, as most people will launch reframe on a login node.
an added benefit is that the dataset is download only once for all the concrete test cases.

running a test locally should be doable with the local parameter, see:
https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.local

laraPPr · 2025-07-31T10:28:16Z

Mare Nostrum does not have login nodes with internet access. Which is a real pita.

laraPPr · 2025-07-31T10:32:00Z

And tensorflow redowloads everytime even when it is cashed.

smoors · 2025-07-31T10:51:40Z

And tensorflow redowloads everytime even when it is cashed.

did you try adding the path parameter to tf.keras.datasets.mnist.load_data?

laraPPr · 2025-07-31T12:03:53Z

And tensorflow redowloads everytime even when it is cashed.

did you try adding the path parameter to tf.keras.datasets.mnist.load_data?

No I did not try that. Will try it after the meeting.

Signed-off-by: laraPPr <[email protected]>

eessi/testsuite/tests/apps/tensorflow/tensorflow.py

…t in configs Signed-off-by: laraPPr <[email protected]>

laraPPr · 2025-07-31T17:03:29Z

I think we need to update that CI because it is not testing the TensorFlow tests not sure why Tests for EESSI test suite, using EESSI production repo / test_with_eessi_pilot (2023.06)

laraPPr · 2025-08-14T10:44:30Z

Converted to draft because I've now switched the other prs to use features instead of extras

Signed-off-by: laraPPr <[email protected]>

casparvl · 2025-08-14T12:30:12Z

eessi/testsuite/tests/apps/tensorflow/tensorflow.py

+        if 'offline' in self.current_partition.features:
+            resourcesdir = self.current_system.resourcesdir
+            data = os.path.join(resourcesdir, self.module_name, 'datasets/mnist.npz')
+            if os.path.exists(data):


See if we can move this check to the after_init state. If so, and if this path does not exist, add -offline to the valid_systems.

I've now split the funtion up in two. The first runs after init to check if the data is their and will set the necessary environment variable if it is. If not valid_systems gets edited with the new hook from #279. The other function is still required to run after setup because we need self.current_partition.features.

…hout the data present.

casparvl · 2025-10-23T12:13:32Z

eessi/testsuite/tests/apps/tensorflow/tensorflow.py

+            txt += 'You can download the file running tf.keras.datasets.mnist.load_data() '
+            txt += f'with {self.module_name} on a system with internet access.'
+            utils.log(txt)
+            hooks.filter_valid_systems_for_offline_partitions(self)


This function doesn't exist in this PR (maybe you forgot to push it?), but I assume it should do something like:

valid_systems = valid_systems + ['-offline']

So that if the data path does not exist, the test declares that 'this can not run on system with the offline feature', essentially

casparvl

I added

FEATURES.OFFLINE,

to our local config. Then, the test got skipped. From the logs:

[2025-10-23T21:45:07] debug: reframe: [check_files_for_offline_run]: Warning: will exclude TensorFlow/2.13.0-foss-2023a tests on offline partitions.
Because reframe could not find ./TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz.
You can download the file running tf.keras.datasets.mnist.load_data() with TensorFlow/2.13.0-foss-2023a on a system with internet access.

Then, I downloaded the dataset. Since I don't set a resourcesdir explicitly in my ReFrame config, it's using the current directory as default. Unfortunately, in load_data(path=...) the path is relative to the .keras/datasets folder according to the documentation https://www.tensorflow.org/api_docs/python/tf/keras/datasets/mnist/load_data#args. Fortunately, a full path is respected, so I could download it using:

mkdir -p TensorFlow/2.13.0-foss-2023a/datasets
python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data(path='/home/casparl/EESSI/test-suite/TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz')"

Note the mkdir -p: the directory needs to exist, otherwise you get a FileNotFoundError.

I think we should improve the instruction so that it prints these exact commands. ReFrame knows what it's own resourcesdir is, we can print that as part of the path.

Now the bad news: rerunning still gives me:

debug: reframe: [check_files_for_offline_run]: Warning: will exclude TensorFlow/2.13.0-foss-2023a tests on offline partitions.
Because reframe could not find ./TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz.
You can download the file running tf.keras.datasets.mnist.load_data() with TensorFlow/2.13.0-foss-2023a on a system with internet access.

Reason is that the current working dir for a test is the location of the test file:

[2025-10-23T22:13:16] debug: reframe: [check_files_for_offline_run]: Current working dir: /gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow
[2025-10-23T22:13:16] debug: reframe: [check_files_for_offline_run]: Warning: will exclude TensorFlow/2.13.0-foss-2023a tests on offline partitions.

That's another reason why it's useful if it gets printed from the test itself: if the resourcesdir is ., it can use os.getcwd() instead.

…e warning visible on stdout, not just in the log - because there it is too hidden. Finally, make sure to _only_ print the warning if there is _at least_ one offline partition

casparvl · 2025-10-23T21:05:28Z

After 1019f21 I now ran again and got:

WARNING: Will exclude EESSI_TensorFlow %scale=16_nodes %module_name=TensorFlow/2.13.0-foss-2023a %device_type=gpu tests on offline partitions.
Because reframe could not find /gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz.
The download the data, please run:.
mkdir -p /gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets && module load TensorFlow/2.13.0-foss-2023a && python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data('/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz')"
on a system with internet access.

Then, I copy-pasted the instructions:

(eessi_test_venv) [casparl@int6 test-suite]$ mkdir -p /gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets && module load TensorFlow/2.13.0-foss-2023a && python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data('/gpfs/home4/casparl/EESSI/test-suite/eessi/testsuite/tests/apps/tensorflow/TensorFlow/2.13.0-foss-2023a/datasets/mnist.npz')"
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11490434/11490434 [==============================] - 1s 0us/step

and ran again:

(eessi_test_venv) [casparl@int6 test-suite]$ reframe -c eessi/testsuite/tests/apps/tensorflow/ -t 1_node -t CI --run --system snellius:rome
...
[ RUN      ] EESSI_TensorFlow %scale=1_node %module_name=TensorFlow/2.13.0-foss-2023a %device_type=cpu /9864d0f5 @snellius:rome+default
[       OK ] (1/1) EESSI_TensorFlow %scale=1_node %module_name=TensorFlow/2.13.0-foss-2023a %device_type=cpu /9864d0f5 @snellius:rome+default
P: perf: 59760.135314296705 img/s (r:0, l:None, u:None)

I also checked that the warning does not get printed if none of the partitions in the current system have FEATURES.OFFLINE.

I do wonder two things:

It would be nice if we can only print the warning once. Now, we get it printed for every test instance, that is not really needed and very verbose. It also doesn't scale very well if we get more of such tests. Maybe we can make a utils function that keeps some global state that 'this error' has already printed. Of course, that means 'this error' needs some kind of unique signature, I'm not immediately sure how to do that. But if we have something like that, we could make (and call) a utils or hooks function to print the error, and then implement the logic there so that it only gets done once.
We now do the warning in the init stage, and use a feature to make sure the test doesn't get generated. Should we push it to the setup stage, so that we can query if the current_partition has an offline feature, and then use a skip-if? The test instance then gets generated, but in the skip-if we can easily print the instruction. This ensures that it only gets printed if we are actually trying to schedule the test on an offline partition. The current logic will -also- print it if the system contains one offline partition, but we current try to schedule to another partition. That may be slightly confusing.

casparvl · 2025-10-23T21:07:07Z

I'm also not sure why I get this redefinition issue in flake8. The line number seems wrong, and I really don't see a redefinition?

laraPPr added 3 commits July 10, 2025 12:04

First try to work around needing to download the dataset fom tensorfl…

deed7dd

…ow test

work in progress make tensorflow test also work ofline

f4ecb2d

Merge branch 'EESSI:main' into TensorFlow_offline

6287c7d

laraPPr marked this pull request as draft July 10, 2025 14:04

laraPPr added 2 commits July 14, 2025 10:43

use general path where to find downloaded files

7bc7a52

Signed-off-by: laraPPr <[email protected]>

make pyhton linter happy

230e097

Signed-off-by: laraPPr <[email protected]>

laraPPr mentioned this pull request Jul 14, 2025

Failing tests on offline nodes #263

Open

laraPPr added 2 commits July 14, 2025 11:06

make pyhton linter happy

caeb24f

Signed-off-by: laraPPr <[email protected]>

remove comment

b81c8d1

Signed-off-by: laraPPr <[email protected]>

laraPPr changed the title ~~[WIP] Make sure TensorFlow also works on offline machines~~ Make sure TensorFlow also works on offline machines Jul 14, 2025

laraPPr marked this pull request as ready for review July 14, 2025 09:51

laraPPr commented Jul 14, 2025

View reviewed changes