[SPARK-44155] Adding a dev utility to improve error messages based on LLM #41711

itholic · 2023-06-23T09:14:58Z

What changes were proposed in this pull request?

This PR proposes adding a utility script that can help increase productivity in error message improvement tasks.

When passing the error class name to the script, detailed information(e.g. error message, source code of the function where the error actually occurs) corresponding to the error class is sent to ChatGPT to request improvement of the error message.

Usage example:

haejoon.lee@spark dev % python error_message_refiner.py AMBIGUOUS_LATERAL_COLUMN_ALIAS
[AMBIGUOUS_LATERAL_COLUMN_ALIAS]
Before: Lateral column alias <name> is ambiguous and has <n> matches.
After: Lateral column alias <name> is ambiguous and has <n> matches. Please ensure that the referenced lateral column alias is uniquely identified, or provide a more specific alias in your query.

Developers only need to review the printed Before/After and apply it to the actual code if the content is appropriate. (Of course, extra refining should be done if necessary)

Why are the changes needed?

This is because the error messages are not consistent enough and do not follow the Error Message Guidelines.

Although we provide guidelines for consistent and high-quality error messages, they are not consistently followed in practice. For example, the guidelines state that error messages should explicitly include "What, Why, and How". However, there are cases where only the "What" aspect is described, such as "Cannot parse decimal," which may not provide sufficient help to users. Therefore, providing a script that improves error messages in compliance with the Error Message Guidelines of Apache Spark would greatly facilitate developers in enhancing error messages and encourage adherence to the guidelines.

For example, the error message below only answers for "What", but not for "Why" and "How".

Lateral column alias <name> is ambiguous and has <n> matches.

By using the script, developers can improve error messages based on the guidelines, enabling us to enforce adherence to the guidelines and ultimately enhance error messages consistently and with high quality as below:

Cannot resolve lateral column alias <name> in qualifying expression <qualifying expression> as it has multiple matches. Check the name and qualifying expression to ensure uniqueness and specificity.

This error message follows the guidelines by including What (column aliases are not allowed), Why (the error is caused by the use of these aliases), and How (the user can solve the error by removing the column aliases). And also this follows the best practice of using the present tense to describe the error and providing suggestions.

Here is the full script output suggested from ChatGPT below:

[AMBIGUOUS_LATERAL_COLUMN_ALIAS]
Before: Lateral column alias <name> is ambiguous and has <n> matches.
After: To improve the error message "Lateral column alias <name> is ambiguous and has <n> matches", I suggest the following:

"Cannot resolve lateral column alias <name> as it has multiple matches. Check the name and qualifying expression to ensure uniqueness and specificity."

This error message follows the guidelines as it includes What, Why, and How, addresses the issue of multiple matches, provides clear language, avoids time-based statements, uses present tense, and avoids sounding accusatory or judgmental.

Furthermore, to help users understand the context of the error message, we can provide additional detail in the error message by modifying the source code to include the qualifying expression in the error message. For example:

"Cannot resolve lateral column alias <name> in qualifying expression <qualifying expression> as it has multiple matches. Check the name and qualifying expression to ensure uniqueness and specificity."

This error message not only helps users to understand the issue but also guides them towards resolving it. It is crucial to provide clear and helpful error messages to make it easy for users to diagnose and fix issues.

Does this PR introduce any user-facing change?

No, this is an effort to improve productivity for developers.

How was this patch tested?

Manually tested.

I actually used this script for #41455 , so this can be an in-practice example that demonstrates the utilization.

itholic · 2023-06-23T09:16:12Z

cc @gengliangwang @allisonwang-db @MaxGekk @cloud-fan could you take a look at this PR when you find some time?

itholic · 2023-06-23T09:29:08Z

BTW, I set the category as SQL because it works based on the SQL error class, but I'm not sure if this is correct.

dev/error_message_refiner.py

HyukjinKwon · 2023-06-24T05:24:21Z

dev/error_message_refiner.py

+                       If not provided, the default version("gpt-3.5-turbo") will be used.
+
+    Note:
+    - Ensure that the necessary dependencies are installed before running the script.


What are the dependencies?

We need openai (https://github.com/openai/openai-python) for running this script. Maybe should we add this to the dev-requirements.txt ??

dev/api_key.txt

dev/error_message_refiner.py

gatorsmile · 2023-06-24T06:16:10Z

dev/error_message_refiner.py

+#
+
+"""
+Utility for refining error messages based on LLM.


We also can use the same way to convert a temp error class to a meaningful class name.

Yeah, I have a separate script to convert temp error class. I will post a PR right away with the same comments reflected when the review of current PR is completed.

srowen

This doesn't belong in Spark IMHO

dev/error_message_refiner.py

gatorsmile · 2023-06-24T15:58:06Z

We published the error guideline a few years ago, but not all contributors adhered to it, resulting in variable quality in error messages. Since ChatGPT-4 has demonstrated a solid understanding of Spark from just a few attempts, I believe we should advocate for its use within the community to enhance Spark.

This script is designed to simplify the process and provide an effective prompt, which is crucial for ChatGPT's generation of high-quality errors. Rather than depending on the community to learn how to write the prompt, we should take the initiative and do it for everyone

srowen · 2023-06-24T16:17:10Z

Do we need this tool? or just need to run it?

mridulm · 2023-06-24T16:19:09Z

It is unclear to me what the purpose of this PR is ...

Why do we need this ? What problem is it solving ? Is it common enough to require this ?
Who is going to use this ? Is it developers ? Reviewers ?
Does it need to be in spark ? Or can it be a documented in developer tools instead ?

itholic · 2023-06-26T17:22:52Z

Sorry for confusion with insufficient background explanation.

@srowen

Do we need this tool? or just need to run it?

I believe we need this tool to provide a streamlined and consistent approach to refining error messages in Apache Spark (Also considering the comment from @gatorsmile). The tool itself serves as a utility script that interacts with Chat GPT to improve error messages based on predefined guidelines. By using this tool, developers can easily generate refined error messages and reviewers can evaluate and provide feedback on those messages.

@mridulm

Why do we need this? What problem is it solving? Is it common enough to require this?

As I just mentioned above, the primarily purpose of this script is to improve error messages in Apache Spark. The motivation behind it is to address the inconsistent quality of error messages in Spark. Considering the impressive understanding of Spark demonstrated by ChatGPT-4 in just a few attempts, we believe it is valuable to leverage it to enhance Spark within the community.

Who is going to use this? Is it developers? Reviewers?

This script is primarily intended for Spark developers. Developers can utilize it to receive consistent guidelines and assistance in generating improved error messages. Additionally, reviewers can use the script to review and validate the refined error messages.

Does it need to be in Spark? Or can it be documented in developer tools instead?

In my opinion, both are necessary. By including the script in the Spark developer tools, it can provide assistance to developers in improving error messages. Additionally, documenting the tool is also important to facilitate understanding and usage.

itholic · 2023-06-27T00:13:37Z

I have supplemented the description in the "Why are the changes needed?" section to provide a clearer explanation of the purpose of the PR.

So, the main goal of this PR is to address the lack of adherence to the error message guidelines and inconsistency in the current Apache Spark error messages. By introducing a script, we aim to enforce compliance with the guidelines and facilitate the improvement of error messages.

The script will enable developers to enhance the error messages, ultimately leading to more consistent and higher-quality error messages.

…ed on LLM

itholic · 2023-07-05T00:06:34Z

Any additional comments for this??

allisonwang-db · 2023-07-05T22:59:07Z

dev/error_message_refiner.py

+
+from sparktestsupport import SPARK_HOME
+
+PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/core/src/main/resources/error/error-classes.json"


Can we also support PySpark error classes? pyspark/errors/error_classes.py
Also, can we make SPARK_HOME configurable? for example --spark_home

Thanks for the comment!

Yes, we can support PySpark error classes as well, but I believe we need separate script since the internal structure & usage of error classes is a bit different from JVM side. Let me made a follow-up for Python script if this PR get approved.

For SPARK_HOME, since the script is already using the SPARK_HOME environment variable internally, I believe it would be more appropriate to set the value of Spark home through the SPARK_HOME rather than configuring it here when executing the script.

dev/error_message_refiner.py

itholic · 2023-07-18T23:05:38Z

Resolved comments and CI passed. Any additional comments?

dev/error_message_refiner.py

Co-authored-by: Amanda Liu <[email protected]>

anjakefala · 2023-08-03T05:35:30Z

dev/error_message_refiner.py

+def _git_grep_files(search_string: str, exclude: str = None) -> str:
+    """
+    Executes 'git grep' command to search for files containing the given search string.
+    Returns the file path where the search string is found.


Suggested change

Returns the file path where the search string is found.

Returns a file path where the search string is found.

I would suggest this wording since it returns one random file from a sample of files. =)

zero323 · 2023-08-04T12:52:30Z

Resolved comments and CI passed. Any additional comments?

@itholic I've already raised this issue on the dev list, so just for the record, the ASF Generative Tooling Guidance states the following

(...) code generated in whole or in part using AI can be contributed if the contributor ensures that:

The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition (e.g., ChatGPT’s terms are inconsistent).

...

The above text should apply for documentation as well. However, the most popular tooling for documentation, ChatGPT, has restrictive licensing, so caution should be applied.

bjornjorgensen · 2023-08-04T17:22:21Z

"This is because the error messages are not consistent enough and do not follow the Error Message Guidelines. "

The link is wrong for me.. it goes to #37113

This is the right link https://spark.apache.org/error-message-guidelines.html

itholic · 2023-08-25T06:24:04Z

Thanks all for the review. According to the ASF Generative Tooling Guidance as mentioned from @zero323 , I think we shouldn't include the ChatGPT related stuffs directly into our code base. Let me close this PR for now.

smileyboy2019 · 2025-04-22T08:58:07Z

May I ask if there is currently support for LLM large models to support Spark

github-actions bot added the BUILD label Jun 23, 2023

gatorsmile reviewed Jun 23, 2023

View reviewed changes

dev/error_message_refiner.py Show resolved Hide resolved

HyukjinKwon reviewed Jun 24, 2023

View reviewed changes

dev/api_key.txt Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jun 24, 2023

View reviewed changes

dev/error_message_refiner.py Show resolved Hide resolved

HyukjinKwon changed the title ~~[SPARK-44155][SQL] Adding a dev utility to improve error messages based on LLM~~ [SPARK-44155] Adding a dev utility to improve error messages based on LLM Jun 24, 2023

gatorsmile reviewed Jun 24, 2023

View reviewed changes

srowen reviewed Jun 24, 2023

View reviewed changes

gatorsmile reviewed Jun 24, 2023

View reviewed changes

dev/error_message_refiner.py Outdated Show resolved Hide resolved

[SPARK-44155][SQL] Adding a dev utility to improve error messages bas…

ece0bcc

…ed on LLM

itholic force-pushed the llm_error branch from 865cb23 to ece0bcc Compare June 27, 2023 21:48

itholic added 3 commits June 29, 2023 15:27

Merge branch 'master' of https://github.com/apache/spark into llm_error

3500407

Merge branch 'master' of https://github.com/apache/spark into llm_error

0f40c4a

Merge branch 'master' of https://github.com/apache/spark into llm_error

0e5d3d9

allisonwang-db reviewed Jul 5, 2023

View reviewed changes

itholic added 2 commits July 12, 2023 11:15

adjusted comments

0ee2372

Merge branch 'master' of https://github.com/apache/spark into llm_error

77ddfaf

asl3 reviewed Jul 19, 2023

View reviewed changes

dev/error_message_refiner.py Outdated Show resolved Hide resolved

itholic and others added 3 commits July 20, 2023 10:45

Update dev/error_message_refiner.py

accb87a

Co-authored-by: Amanda Liu <[email protected]>

Merge branch 'master' of https://github.com/apache/spark into llm_error

c5a4138

Merge branch 'llm_error' of github.com:itholic/spark into llm_error

f941b4e

allisonwang-db mentioned this pull request Jul 27, 2023

[SPARK-44059][SQL] Add better error messages for SQL named argumnts #42177

Closed

Merge branch 'master' of https://github.com/apache/spark into llm_error

5095f9f

anjakefala reviewed Aug 3, 2023

View reviewed changes

itholic closed this Aug 25, 2023

itholic deleted the llm_error branch November 20, 2023 01:35


		from sparktestsupport import SPARK_HOME

		PATH_TO_ERROR_CLASS = f"{SPARK_HOME}/core/src/main/resources/error/error-classes.json"

	Returns the file path where the search string is found.
	Returns a file path where the search string is found.

[SPARK-44155] Adding a dev utility to improve error messages based on LLM #41711

[SPARK-44155] Adding a dev utility to improve error messages based on LLM #41711

Uh oh!

Conversation

itholic commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

itholic commented Jun 23, 2023

Uh oh!

itholic commented Jun 23, 2023

Uh oh!

Uh oh!

HyukjinKwon Jun 24, 2023

Choose a reason for hiding this comment

Uh oh!

itholic Jun 26, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gatorsmile Jun 24, 2023

Choose a reason for hiding this comment

Uh oh!

itholic Jun 24, 2023

Choose a reason for hiding this comment

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gatorsmile commented Jun 24, 2023

Uh oh!

srowen commented Jun 24, 2023

Uh oh!

mridulm commented Jun 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itholic commented Jun 26, 2023

Uh oh!

itholic commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itholic commented Jul 5, 2023

Uh oh!

allisonwang-db Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

itholic Jul 12, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

itholic commented Jul 18, 2023

Uh oh!

Uh oh!

anjakefala Aug 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zero323 commented Aug 4, 2023

Uh oh!

bjornjorgensen commented Aug 4, 2023

Uh oh!

itholic commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smileyboy2019 commented Apr 22, 2025

Uh oh!

Uh oh!

itholic commented Jun 23, 2023 •

edited

Loading

mridulm commented Jun 24, 2023 •

edited

Loading

itholic commented Jun 27, 2023 •

edited

Loading

anjakefala Aug 3, 2023 •

edited

Loading

itholic commented Aug 25, 2023 •

edited

Loading