Skip to content

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Jun 23, 2023

What changes were proposed in this pull request?

This PR proposes adding a utility script that can help increase productivity in error message improvement tasks.

When passing the error class name to the script, detailed information(e.g. error message, source code of the function where the error actually occurs) corresponding to the error class is sent to ChatGPT to request improvement of the error message.

Usage example:

haejoon.lee@spark dev % python error_message_refiner.py AMBIGUOUS_LATERAL_COLUMN_ALIAS
[AMBIGUOUS_LATERAL_COLUMN_ALIAS]
Before: Lateral column alias <name> is ambiguous and has <n> matches.
After: Lateral column alias <name> is ambiguous and has <n> matches. Please ensure that the referenced lateral column alias is uniquely identified, or provide a more specific alias in your query.

Developers only need to review the printed Before/After and apply it to the actual code if the content is appropriate. (Of course, extra refining should be done if necessary)

Why are the changes needed?

This is because the error messages are not consistent enough and do not follow the Error Message Guidelines.

Although we provide guidelines for consistent and high-quality error messages, they are not consistently followed in practice. For example, the guidelines state that error messages should explicitly include "What, Why, and How". However, there are cases where only the "What" aspect is described, such as "Cannot parse decimal," which may not provide sufficient help to users. Therefore, providing a script that improves error messages in compliance with the Error Message Guidelines of Apache Spark would greatly facilitate developers in enhancing error messages and encourage adherence to the guidelines.

For example, the error message below only answers for "What", but not for "Why" and "How".

Lateral column alias <name> is ambiguous and has <n> matches.

By using the script, developers can improve error messages based on the guidelines, enabling us to enforce adherence to the guidelines and ultimately enhance error messages consistently and with high quality as below:

Cannot resolve lateral column alias <name> in qualifying expression <qualifying expression> as it has multiple matches. Check the name and qualifying expression to ensure uniqueness and specificity.

This error message follows the guidelines by including What (column aliases are not allowed), Why (the error is caused by the use of these aliases), and How (the user can solve the error by removing the column aliases). And also this follows the best practice of using the present tense to describe the error and providing suggestions.

Here is the full script output suggested from ChatGPT below:

[AMBIGUOUS_LATERAL_COLUMN_ALIAS]
Before: Lateral column alias <name> is ambiguous and has <n> matches.
After: To improve the error message "Lateral column alias <name> is ambiguous and has <n> matches", I suggest the following:

"Cannot resolve lateral column alias <name> as it has multiple matches. Check the name and qualifying expression to ensure uniqueness and specificity."

This error message follows the guidelines as it includes What, Why, and How, addresses the issue of multiple matches, provides clear language, avoids time-based statements, uses present tense, and avoids sounding accusatory or judgmental.

Furthermore, to help users understand the context of the error message, we can provide additional detail in the error message by modifying the source code to include the qualifying expression in the error message. For example:

"Cannot resolve lateral column alias <name> in qualifying expression <qualifying expression> as it has multiple matches. Check the name and qualifying expression to ensure uniqueness and specificity."

This error message not only helps users to understand the issue but also guides them towards resolving it. It is crucial to provide clear and helpful error messages to make it easy for users to diagnose and fix issues.

Does this PR introduce any user-facing change?

No, this is an effort to improve productivity for developers.

How was this patch tested?

Manually tested.

I actually used this script for #41455 , so this can be an in-practice example that demonstrates the utilization.

@github-actions github-actions bot added the BUILD label Jun 23, 2023
@itholic
Copy link
Contributor Author

itholic commented Jun 23, 2023

cc @gengliangwang @allisonwang-db @MaxGekk @cloud-fan could you take a look at this PR when you find some time?

@itholic
Copy link
Contributor Author

itholic commented Jun 23, 2023

BTW, I set the category as SQL because it works based on the SQL error class, but I'm not sure if this is correct.

If not provided, the default version("gpt-3.5-turbo") will be used.

Note:
- Ensure that the necessary dependencies are installed before running the script.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the dependencies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need openai (https://github.com/openai/openai-python) for running this script. Maybe should we add this to the dev-requirements.txt ??

@HyukjinKwon HyukjinKwon changed the title [SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM [SPARK-44155] Adding a dev utility to improve error messages based on LLM Jun 24, 2023
#

"""
Utility for refining error messages based on LLM.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also can use the same way to convert a temp error class to a meaningful class name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have a separate script to convert temp error class. I will post a PR right away with the same comments reflected when the review of current PR is completed.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong in Spark IMHO

@gatorsmile
Copy link
Member

We published the error guideline a few years ago, but not all contributors adhered to it, resulting in variable quality in error messages. Since ChatGPT-4 has demonstrated a solid understanding of Spark from just a few attempts, I believe we should advocate for its use within the community to enhance Spark.

This script is designed to simplify the process and provide an effective prompt, which is crucial for ChatGPT's generation of high-quality errors. Rather than depending on the community to learn how to write the prompt, we should take the initiative and do it for everyone

@srowen
Copy link
Member

srowen commented Jun 24, 2023

Do we need this tool? or just need to run it?

@mridulm
Copy link
Contributor

mridulm commented Jun 24, 2023

It is unclear to me what the purpose of this PR is ...

  • Why do we need this ? What problem is it solving ? Is it common enough to require this ?
  • Who is going to use this ? Is it developers ? Reviewers ?
  • Does it need to be in spark ? Or can it be a documented in developer tools instead ?

@itholic
Copy link
Contributor Author

itholic commented Jun 26, 2023

Sorry for confusion with insufficient background explanation.

@srowen

Do we need this tool? or just need to run it?

I believe we need this tool to provide a streamlined and consistent approach to refining error messages in Apache Spark (Also considering the comment from @gatorsmile). The tool itself serves as a utility script that interacts with Chat GPT to improve error messages based on predefined guidelines. By using this tool, developers can easily generate refined error messages and reviewers can evaluate and provide feedback on those messages.

@mridulm

Why do we need this? What problem is it solving? Is it common enough to require this?

As I just mentioned above, the primarily purpose of this script is to improve error messages in Apache Spark. The motivation behind it is to address the inconsistent quality of error messages in Spark. Considering the impressive understanding of Spark demonstrated by ChatGPT-4 in just a few attempts, we believe it is valuable to leverage it to enhance Spark within the community.

Who is going to use this? Is it developers? Reviewers?

This script is primarily intended for Spark developers. Developers can utilize it to receive consistent guidelines and assistance in generating improved error messages. Additionally, reviewers can use the script to review and validate the refined error messages.

Does it need to be in Spark? Or can it be documented in developer tools instead?

In my opinion, both are necessary. By including the script in the Spark developer tools, it can provide assistance to developers in improving error messages. Additionally, documenting the tool is also important to facilitate understanding and usage.

@itholic
Copy link
Contributor Author

itholic commented Jun 27, 2023

I have supplemented the description in the "Why are the changes needed?" section to provide a clearer explanation of the purpose of the PR.

So, the main goal of this PR is to address the lack of adherence to the error message guidelines and inconsistency in the current Apache Spark error messages. By introducing a script, we aim to enforce compliance with the guidelines and facilitate the improvement of error messages.

The script will enable developers to enhance the error messages, ultimately leading to more consistent and higher-quality error messages.

@itholic
Copy link
Contributor Author

itholic commented Jul 5, 2023

Any additional comments for this??


from sparktestsupport import SPARK_HOME

PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/core/src/main/resources/error/error-classes.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also support PySpark error classes? pyspark/errors/error_classes.py
Also, can we make SPARK_HOME configurable? for example --spark_home

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment!

Yes, we can support PySpark error classes as well, but I believe we need separate script since the internal structure & usage of error classes is a bit different from JVM side. Let me made a follow-up for Python script if this PR get approved.

For SPARK_HOME, since the script is already using the SPARK_HOME environment variable internally, I believe it would be more appropriate to set the value of Spark home through the SPARK_HOME rather than configuring it here when executing the script.

@itholic
Copy link
Contributor Author

itholic commented Jul 18, 2023

Resolved comments and CI passed. Any additional comments?

def _git_grep_files(search_string: str, exclude: str = None) -> str:
"""
Executes 'git grep' command to search for files containing the given search string.
Returns the file path where the search string is found.
Copy link

@anjakefala anjakefala Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns the file path where the search string is found.
Returns a file path where the search string is found.

I would suggest this wording since it returns one random file from a sample of files. =)

@zero323
Copy link
Member

zero323 commented Aug 4, 2023

Resolved comments and CI passed. Any additional comments?

@itholic I've already raised this issue on the dev list, so just for the record, the ASF Generative Tooling Guidance states the following

(...) code generated in whole or in part using AI can be contributed if the contributor ensures that:

  1. The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).

...

The above text should apply for documentation as well. However, the most popular tooling for documentation, ChatGPT, has restrictive licensing, so caution should be applied.

@bjornjorgensen
Copy link
Contributor

"This is because the error messages are not consistent enough and do not follow the Error Message Guidelines. "

The link is wrong for me.. it goes to #37113

This is the right link https://spark.apache.org/error-message-guidelines.html

@itholic
Copy link
Contributor Author

itholic commented Aug 25, 2023

Thanks all for the review. According to the ASF Generative Tooling Guidance as mentioned from @zero323 , I think we shouldn't include the ChatGPT related stuffs directly into our code base. Let me close this PR for now.

@itholic itholic closed this Aug 25, 2023
@itholic itholic deleted the llm_error branch November 20, 2023 01:35
@smileyboy2019
Copy link

May I ask if there is currently support for LLM large models to support Spark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.