Skip to content

Conversation

@myandpr
Copy link

@myandpr myandpr commented Oct 11, 2025

Problem

We encountered an issue when submitting a job using the following command:
raydp-submit --ray-conf /root/ray.conf --py-files file.zip main.py

The parameters for distributing files (such as --py-files) do not properly distribute the files to the executors. As a result, the executors cannot access or import the code from the files specified in --py-files.

Below is the error stack trace:

File "/usr/local/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/usr/local/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 71, in read_command
    command = serializer._read_with_length(file)
  File "/usr/local/spark-current/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/local/spark-current/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'XXX'
) [duplicate 16]
  File "/usr/local/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/usr/local/spark-current/python/lib/pyspark.zip/pyspark/worker.py", line 71, in read_command
    command = serializer._read_with_length(file)
  File "/usr/local/spark-current/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/local/spark-current/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'XXX'

How to solve it

Previously, the Ray cluster master was grouped under OTHERS, so the option assigner skipped writing --files/--archives into spark.files and spark.archives, which left executors without the distributed files.

Explicitly recognizing ray:// as RAY and including RAY in the related distribution logic is what makes these settings take effect now.

@myandpr myandpr force-pushed the support-distribute-files branch from 41b08b0 to b6ac995 Compare October 11, 2025 10:04
@myandpr
Copy link
Author

myandpr commented Oct 11, 2025

@carsonwang @pang-wu Could you please help review this PR? Thank you very much!

Copy link
Collaborator

@pang-wu pang-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myandpr Thanks for the contribution, can you add a unit test?

@myandpr myandpr force-pushed the support-distribute-files branch from acd2d01 to 72cd4e6 Compare October 12, 2025 16:14
module_path.write_text("VALUE = 'pyfiles works'\n")

py_files_path = tmp_path / "extra_module.zip"
with zipfile.ZipFile(py_files_path, "w") as zip_file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't have to zip it, spark should support submitting .py files

@carsonwang
Copy link
Collaborator

@myandpr Thank you for the PR. The cluster manager name "OTHERS" is what we added in this customized SparkSubmit.scala for Ray. We used a general name "OTHERS" instead of "RAY" because we tried to upstream the changes to Spark in early days. I feel there is no need to add both "OTHERS" and "RAY" in the file. A few options such as args.jars have been added for "OTHERS", but a few others are missing as you have seen the problem. I think you can just continue to use "OTHERS" but add the missed options.

@pang-wu
Copy link
Collaborator

pang-wu commented Nov 4, 2025

I created another PR: #441

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants