Skip to content

Conversation

jordan-powers
Copy link
Contributor

Currently, when a synthetic source index has a keyword field with a normalizer, the original, non-normalized value of the field is stored in _ignored_source so that the original source can be reconstructed. However, this can create significant storage overhead as we are essentially double-storing the value.

This PR adds a new boolean keyword mapper parameter normalizer_skip_store_original_value. When this value is set, the original value is not stored in _ignored_source and is instead discarded. The source will be reconstructed using the normalized value.

For custom normalizers, this parameter will default to false and the original value will be stored. However, for the built-in lowercase normalizer, the parameter will default to true and the original value will not be stored.

This is a breaking change as previously keyword field mappers with the lowercase normalizer would default to storing the original value.

Relates to #124369.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @jordan-powers, I've created a changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

Copy link
Contributor

@Kubik42 Kubik42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks good to me for the most part.

One concern I have is around naming- "skip_store" combined with false, results in a double negative, which might be a bit unclear to customers. Perhaps we can drop "skip" and instead flip the default around like so:

normalizer_store_original_value = true (default)
normalizer_store_original_value = false

keyword:
type: keyword
normalizer: lowercase
normalizer_skip_store_original_value: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be configured if it defaults to false anyways?

Copy link
Contributor Author

@jordan-powers jordan-powers Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It defaults to true for the lowercase normalizer, so I have to explicitly configure it to false here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this test is defining a custom normalizer called "lowercase". Currently the default logic will trigger for any normalizer called "lowercase", no matter if it's the built-in one or a custom one shadowing the built-in one. I'll look into what it would take to only set the default for the built-in one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this should be addressed as of 6c7f6bf

"normalizer_skip_store_original_value",
false,
m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(),
() -> "lowercase".equals(normalizer.getValue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] do you think it makes sense to extract "lowercase" into an enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll try and extract it into some form of constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this, and the built-in lowercase normalizer is registered with a string literal "lowercase" (code). I could extract that out into a constant, then reference the constant here, but I'm reluctant to touch the AnalysisModule in this PR.

I think it's probably fine to just leave the string literal here, but if we do want to extract out the constant, I think it'd make sense as a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalizer mapping attribute can contain any value, because it is allowed to define custom normalizers via index settings. So I think it is best to keep this as a string.

assertEquals(new BytesRef("foo"), doc.rootDoc().getField("field2").binaryValue());
}

public void testNormalizerSyntheticSource() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] name could be more descriptive, like the other tests below it. Maybe testNormalizerSyntheticSourceWhenSkipStoreOriginalValueDisabled?

this.normalizerSkipStoreOriginalValue = Parameter.boolParam(
"normalizer_skip_store_original_value",
false,
m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow customers to set this in the first place when synthetic source is not enabled? They might find it confusing if they enable it but it doesn't work for whatever reason. In reality, they just don't have synthetic source enabled.

Just a question. I'm not sure how we normally deal with parameters unique to synthetic source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point, although I'm not sure I would want to completely disallow it and make setting it on a non-synthetic index a fatal invalid mapping exception. Maybe just a warning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add validations or warnings. A cluster can start with synthetic source, but then fall back to basic and then use stored source, this shouldn't cause any warning or failures.

I think this mapping should just be a no-op in case source mode isn't synthetic.

).acceptsNull();
this.normalizerSkipStoreOriginalValue = Parameter.boolParam(
"normalizer_skip_store_original_value",
false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it ok to go from false -> true here? Although probably not ok the other way around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would work. Going from false -> true would cause previously indexed documents to start returning their normalized values, and the original values stored in _ignored_source would be unused.

I'm reluctant to allow it though, as I think it might create some unnecessary confusion. It's simpler to reason about and to debug if the values are all stored the same way throughout the life of the index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's keep this mapping attribute immutable.

@jordan-powers
Copy link
Contributor Author

Yeah, I went back and forth on whether to name it normalizer_skip_store_original_value with a default of false, or normalizer_store_original_value with a default of true. I settled on normalizer_skip_store_original_value because I felt that this feature is additional functionality the user is opting-in to, and when opting-in it makes more sense to be setting a value to true.

But really I could justify either name for the parameter, so if calling it normalizer_store_original_value makes more sense to people I'm happy to switch it.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one small testing comment, LGTM otherwise.

).acceptsNull();
this.normalizerSkipStoreOriginalValue = Parameter.boolParam(
"normalizer_skip_store_original_value",
false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's keep this mapping attribute immutable.

"normalizer_skip_store_original_value",
false,
m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(),
() -> "lowercase".equals(normalizer.getValue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalizer mapping attribute can contain any value, because it is allowed to define custom normalizers via index settings. So I think it is best to keep this as a string.

this.normalizerSkipStoreOriginalValue = Parameter.boolParam(
"normalizer_skip_store_original_value",
false,
m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add validations or warnings. A cluster can start with synthetic source, but then fall back to basic and then use stored source, this shouldn't cause any warning or failures.

I think this mapping should just be a no-op in case source mode isn't synthetic.

keyword: [ "do or do not, there is no try", "may the force be with you!" ]
keyword_with_ignore_above: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]
keyword_without_doc_values: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a test here that uses a custom normalizer that does something else than lowercasing? (e.g. asciifolding or uppercase) And check that we retain original value for keyword field?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants