-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-44155] Adding a dev utility to improve error messages based on LLM #41711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @gengliangwang @allisonwang-db @MaxGekk @cloud-fan could you take a look at this PR when you find some time? |
BTW, I set the category as |
If not provided, the default version("gpt-3.5-turbo") will be used. | ||
|
||
Note: | ||
- Ensure that the necessary dependencies are installed before running the script. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need openai
(https://github.com/openai/openai-python) for running this script. Maybe should we add this to the dev-requirements.txt
??
# | ||
|
||
""" | ||
Utility for refining error messages based on LLM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also can use the same way to convert a temp error class to a meaningful class name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have a separate script to convert temp error class. I will post a PR right away with the same comments reflected when the review of current PR is completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't belong in Spark IMHO
We published the error guideline a few years ago, but not all contributors adhered to it, resulting in variable quality in error messages. Since ChatGPT-4 has demonstrated a solid understanding of Spark from just a few attempts, I believe we should advocate for its use within the community to enhance Spark. This script is designed to simplify the process and provide an effective prompt, which is crucial for ChatGPT's generation of high-quality errors. Rather than depending on the community to learn how to write the prompt, we should take the initiative and do it for everyone |
Do we need this tool? or just need to run it? |
It is unclear to me what the purpose of this PR is ...
|
Sorry for confusion with insufficient background explanation.
I believe we need this tool to provide a streamlined and consistent approach to refining error messages in Apache Spark (Also considering the comment from @gatorsmile). The tool itself serves as a utility script that interacts with Chat GPT to improve error messages based on predefined guidelines. By using this tool, developers can easily generate refined error messages and reviewers can evaluate and provide feedback on those messages.
As I just mentioned above, the primarily purpose of this script is to improve error messages in Apache Spark. The motivation behind it is to address the inconsistent quality of error messages in Spark. Considering the impressive understanding of Spark demonstrated by ChatGPT-4 in just a few attempts, we believe it is valuable to leverage it to enhance Spark within the community.
This script is primarily intended for Spark developers. Developers can utilize it to receive consistent guidelines and assistance in generating improved error messages. Additionally, reviewers can use the script to review and validate the refined error messages.
In my opinion, both are necessary. By including the script in the Spark developer tools, it can provide assistance to developers in improving error messages. Additionally, documenting the tool is also important to facilitate understanding and usage. |
I have supplemented the description in the "Why are the changes needed?" section to provide a clearer explanation of the purpose of the PR. So, the main goal of this PR is to address the lack of adherence to the error message guidelines and inconsistency in the current Apache Spark error messages. By introducing a script, we aim to enforce compliance with the guidelines and facilitate the improvement of error messages. The script will enable developers to enhance the error messages, ultimately leading to more consistent and higher-quality error messages. |
Any additional comments for this?? |
dev/error_message_refiner.py
Outdated
|
||
from sparktestsupport import SPARK_HOME | ||
|
||
PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/core/src/main/resources/error/error-classes.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also support PySpark error classes? pyspark/errors/error_classes.py
Also, can we make SPARK_HOME configurable? for example --spark_home
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment!
Yes, we can support PySpark error classes as well, but I believe we need separate script since the internal structure & usage of error classes is a bit different from JVM side. Let me made a follow-up for Python script if this PR get approved.
For SPARK_HOME
, since the script is already using the SPARK_HOME
environment variable internally, I believe it would be more appropriate to set the value of Spark home through the SPARK_HOME
rather than configuring it here when executing the script.
Resolved comments and CI passed. Any additional comments? |
def _git_grep_files(search_string: str, exclude: str = None) -> str: | ||
""" | ||
Executes 'git grep' command to search for files containing the given search string. | ||
Returns the file path where the search string is found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns the file path where the search string is found. | |
Returns a file path where the search string is found. |
I would suggest this wording since it returns one random file from a sample of files. =)
@itholic I've already raised this issue on the dev list, so just for the record, the ASF Generative Tooling Guidance states the following
|
"This is because the error messages are not consistent enough and do not follow the Error Message Guidelines. " The link is wrong for me.. it goes to #37113 This is the right link https://spark.apache.org/error-message-guidelines.html |
Thanks all for the review. According to the ASF Generative Tooling Guidance as mentioned from @zero323 , I think we shouldn't include the ChatGPT related stuffs directly into our code base. Let me close this PR for now. |
May I ask if there is currently support for LLM large models to support Spark |
What changes were proposed in this pull request?
This PR proposes adding a utility script that can help increase productivity in error message improvement tasks.
When passing the error class name to the script, detailed information(e.g. error message, source code of the function where the error actually occurs) corresponding to the error class is sent to ChatGPT to request improvement of the error message.
Usage example:
Developers only need to review the printed Before/After and apply it to the actual code if the content is appropriate. (Of course, extra refining should be done if necessary)
Why are the changes needed?
This is because the error messages are not consistent enough and do not follow the Error Message Guidelines.
Although we provide guidelines for consistent and high-quality error messages, they are not consistently followed in practice. For example, the guidelines state that error messages should explicitly include "What, Why, and How". However, there are cases where only the "What" aspect is described, such as "Cannot parse decimal," which may not provide sufficient help to users. Therefore, providing a script that improves error messages in compliance with the Error Message Guidelines of Apache Spark would greatly facilitate developers in enhancing error messages and encourage adherence to the guidelines.
For example, the error message below only answers for "What", but not for "Why" and "How".
By using the script, developers can improve error messages based on the guidelines, enabling us to enforce adherence to the guidelines and ultimately enhance error messages consistently and with high quality as below:
This error message follows the guidelines by including What (column aliases are not allowed), Why (the error is caused by the use of these aliases), and How (the user can solve the error by removing the column aliases). And also this follows the best practice of using the present tense to describe the error and providing suggestions.
Here is the full script output suggested from ChatGPT below:
Does this PR introduce any user-facing change?
No, this is an effort to improve productivity for developers.
How was this patch tested?
Manually tested.
I actually used this script for #41455 , so this can be an in-practice example that demonstrates the utilization.