Skip to content

Conversation

manasvinibs
Copy link
Member

@manasvinibs manasvinibs commented Sep 8, 2025

Description

Implement the replace command in PPL to replace text patterns in specified fields. This PR includes the grammar implementation and basic replacement functionality. This implementation reuses the existing replace functionality. Missing/new features for regex support is added in a separate PR.

Syntax

<source> | replace '<pattern>' WITH '<replacement>' IN field_list

pattern: The text pattern to search for (case-sensitive)
replacement: The text to replace matches with
field_list: Comma-separated list of fields to perform replacement in
Replace creates new fields with prefix 'new_' containing the replaced text, while preserving original fields.

Semantics

Expected Behavior:

  • Action: Creates new fields with replaced text values
  • Scope: Operates only on specified fields
  • Data Preservation: Original fields remain unchanged
  • Case Sensitivity: Text literal matching is case-sensitive
  • Pattern Type: Only supports literal string patterns (wildcards or regex will be added in a separate PR)

Implementation Approach:

  • Creates new fields with 'new_' prefix for each specified field
  • Performs literal string replacement in the specified fields
  • Maintains original field values
  • Validates field existence and pattern/replacement values

Example Queries

-- Replace in single field
source=logs | replace 'error' WITH 'ERROR' IN message

-- Replace in multiple fields 
source=logs | replace 'USA' WITH 'United States' IN country, state

-- Replace with other commands
source=logs | where level='error' | replace 'error' WITH 'ERROR' IN message | sort @timestamp

-- Replace and select specific fields
source=logs | replace 'error' WITH 'ERROR' IN message | fields message, new_message

Output Schema

For each field specified in IN clause, a new field is created:

  • Original field: remains unchanged
  • New field: prefixed with 'new_' containing replaced text

Example:

Input fields: message="error occurred", level="error"
After replace 'error' WITH 'ERROR' IN message, level:

  • message: "error occurred"
  • new_message: "ERROR occurred"
  • level: "error"
  • new_level: "ERROR"

Related Issues

#3975

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Override
public LogicalPlan visitReplace(Replace node, AnalysisContext context) {
throw new UnsupportedOperationException(
"Replace is supported only when " + CALCITE_ENGINE_ENABLED.getKeyValue() + "=true");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a util for this nowadays:

public static UnsupportedOperationException getOnlyForCalciteException(String feature) {

Comment on lines 43 to 49
String patternStr = pattern.toString();
if (patternStr.contains("+")
|| patternStr.contains("-")
|| patternStr.contains("*")
|| patternStr.contains("/")) {
throw new IllegalArgumentException(
"Expression is not allowed in replace pattern. Only string literals are supported.");
Copy link
Collaborator

@Swiddis Swiddis Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused by this logic, we don't allow strings that contain these characters? Or if we're trying to forbid any expressions at all, I'm not sure just the 4 arithmetic operators will be sufficient (e.g. if someone tries to boolean and or something).

If we don't want an expression, the grammar should specify a stringLiteral or something along those lines, instead of expression.

Copy link
Collaborator

@RyanL1997 RyanL1997 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @Swiddis 's question, and I also left a comment in this validation method in general.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this! Yes the idea was to avoid any non-string literals for pattern text. I have updated the grammar token itself to string literal to handle this use case.

Copy link
Collaborator

@RyanL1997 RyanL1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @manasvinibs , thanks for taking this on. And I just left some comments.

}

private void validate() {
if (pattern == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic for the pattern expression hardcodes checks for mathematical operators (+, -, *, /) on L44-47. This is may miss other invalid expressions in my opinion. Maybe consider validating that pattern is a string literal by checking its type (e.g., Literal with DataType.STRING)?

Something like this:

if (!(pattern instanceof Literal && ((Literal) pattern).getType() == DataType.STRING)) {
    throw new IllegalArgumentException("Replace pattern must be a string literal.");
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Updated in the next revision.

Comment on lines 43 to 49
String patternStr = pattern.toString();
if (patternStr.contains("+")
|| patternStr.contains("-")
|| patternStr.contains("*")
|| patternStr.contains("/")) {
throw new IllegalArgumentException(
"Expression is not allowed in replace pattern. Only string literals are supported.");
Copy link
Collaborator

@RyanL1997 RyanL1997 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @Swiddis 's question, and I also left a comment in this validation method in general.

Copy link
Collaborator

@ykmr1224 ykmr1224 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wild card support for rename command is implemented in #4019
Let's utilize the logic when we implement wild card support for replace.


Description
============
| Using ``replace`` command to replace text in one or more fields in the search result. The replaced text appears in new fields with prefix *new_*.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it our intended behavior to introduce new attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently with this implementation I'm reusing existing eval replace function which also exposes the replaced values into new fields. I'm mimicking the same behavior with direct command. But, if we think instead of adding new field, we should replace on the original one I'm open to explore that option. Let me know your thoughts.

============
replace '<pattern>' WITH '<replacement>' IN <field-name>[, <field-name>]...

* pattern: mandatory. The text pattern you want to replace.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify that we currently support only plain text pattern. (pattern could indicate wildcard/regex/etc.)

fetched rows / total rows = 1/1
+-------------------------------+-------------------------------+
| message | new_message |
|------------------------+--------------------------------------|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I see broken table format.

newFieldNames.add(fieldName);
}

// Then add new fields with replaced content using new_ prefix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if new_xxx already exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this behavior and I see system is automatically handling the field name collision by appending a number ("new_country0") and resolving field name conflicts gracefully. I have documented this behavior.

@manasvinibs manasvinibs added PPL Piped processing language calcite calcite migration releated labels Sep 11, 2025
@manasvinibs manasvinibs force-pushed the replace-ppl branch 6 times, most recently from a20ff96 to 966b939 Compare September 11, 2025 20:32
context.relBuilder.call(
SqlStdOperatorTable.REPLACE, fieldRef, patternNode, replacementNode);
projectList.add(replaceCall);
newFieldNames.add(NEW_FIELD_PREFIX + fieldName);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks it does not check existing field and add suffix number as written in the doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes every replace command adds a new_ column and do not conflict with existing column names. Let me know if you think we should change the behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the following doc description is right since this logic doesn't check existence of the field. (Is it something automatically done?)

* If a field with *new_* prefix already exists (e.g., 'new_country'), a number will be appended to create a unique field name (e.g., 'new_country0')

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 2 weeks with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

calcite calcite migration releated enhancement New feature or request PPL Piped processing language stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants