Implement native synthetic source for normalized keywords #136915

jordan-powers · 2025-10-21T21:29:09Z

Currently, when a synthetic source index has a keyword field with a normalizer, the original, non-normalized value of the field is stored in _ignored_source so that the original source can be reconstructed. However, this can create significant storage overhead as we are essentially double-storing the value.

This PR adds a new boolean keyword mapper parameter normalizer_skip_store_original_value. When this value is set, the original value is not stored in _ignored_source and is instead discarded. The source will be reconstructed using the normalized value.

For custom normalizers, this parameter will default to false and the original value will be stored. However, for the built-in lowercase normalizer, the parameter will default to true and the original value will not be stored.

This is a breaking change as previously keyword field mappers with the lowercase normalizer would default to storing the original value.

Relates to #124369.

elasticsearchmachine · 2025-10-21T21:29:33Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-10-21T21:29:34Z

Hi @jordan-powers, I've created a changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

Kubik42

Nice! Looks good to me for the most part.

One concern I have is around naming- "skip_store" combined with false, results in a double negative, which might be a bit unclear to customers. Perhaps we can drop "skip" and instead flip the default around like so:

normalizer_store_original_value = true (default)
normalizer_store_original_value = false

Kubik42 · 2025-10-22T19:25:38Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mget/90_synthetic_source.yml

              keyword:
                type: keyword
                normalizer: lowercase
+                normalizer_skip_store_original_value: false


should this be configured if it defaults to false anyways?

It defaults to true for the lowercase normalizer, so I have to explicitly configure it to false here.

I just noticed this test is defining a custom normalizer called "lowercase". Currently the default logic will trigger for any normalizer called "lowercase", no matter if it's the built-in one or a custom one shadowing the built-in one. I'll look into what it would take to only set the default for the built-in one.

Ok, this should be addressed as of 6c7f6bf

Kubik42 · 2025-10-22T19:30:08Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+                "normalizer_skip_store_original_value",
+                false,
+                m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(),
+                () -> "lowercase".equals(normalizer.getValue())


[nit] do you think it makes sense to extract "lowercase" into an enum?

Yeah, I'll try and extract it into some form of constant.

I looked into this, and the built-in lowercase normalizer is registered with a string literal "lowercase" (code). I could extract that out into a constant, then reference the constant here, but I'm reluctant to touch the AnalysisModule in this PR.

I think it's probably fine to just leave the string literal here, but if we do want to extract out the constant, I think it'd make sense as a follow-up PR.

Normalizer mapping attribute can contain any value, because it is allowed to define custom normalizers via index settings. So I think it is best to keep this as a string.

Kubik42 · 2025-10-22T19:40:15Z

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldMapperTests.java

        assertEquals(new BytesRef("foo"), doc.rootDoc().getField("field2").binaryValue());
    }

+    public void testNormalizerSyntheticSource() throws IOException {


[nit] name could be more descriptive, like the other tests below it. Maybe testNormalizerSyntheticSourceWhenSkipStoreOriginalValueDisabled?

Kubik42 · 2025-10-22T19:47:53Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+            this.normalizerSkipStoreOriginalValue = Parameter.boolParam(
+                "normalizer_skip_store_original_value",
+                false,
+                m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(),


Should we allow customers to set this in the first place when synthetic source is not enabled? They might find it confusing if they enable it but it doesn't work for whatever reason. In reality, they just don't have synthetic source enabled.

Just a question. I'm not sure how we normally deal with parameters unique to synthetic source.

It's a good point, although I'm not sure I would want to completely disallow it and make setting it on a non-synthetic index a fatal invalid mapping exception. Maybe just a warning?

I don't think we should add validations or warnings. A cluster can start with synthetic source, but then fall back to basic and then use stored source, this shouldn't cause any warning or failures.

I think this mapping should just be a no-op in case source mode isn't synthetic.

Kubik42 · 2025-10-22T19:49:30Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

            ).acceptsNull();
+            this.normalizerSkipStoreOriginalValue = Parameter.boolParam(
+                "normalizer_skip_store_original_value",
+                false,


isn't it ok to go from false -> true here? Although probably not ok the other way around.

Yes, that would work. Going from false -> true would cause previously indexed documents to start returning their normalized values, and the original values stored in _ignored_source would be unused.

I'm reluctant to allow it though, as I think it might create some unnecessary confusion. It's simpler to reason about and to debug if the values are all stored the same way throughout the life of the index.

Agreed, let's keep this mapping attribute immutable.

jordan-powers · 2025-10-22T20:20:44Z

Yeah, I went back and forth on whether to name it normalizer_skip_store_original_value with a default of false, or normalizer_store_original_value with a default of true. I settled on normalizer_skip_store_original_value because I felt that this feature is additional functionality the user is opting-in to, and when opting-in it makes more sense to be setting a value to true.

But really I could justify either name for the parameter, so if calling it normalizer_store_original_value makes more sense to people I'm happy to switch it.

…native-synthetic-source

martijnvg

Left one small testing comment, LGTM otherwise.

martijnvg · 2025-10-23T09:42:17Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

            ).acceptsNull();
+            this.normalizerSkipStoreOriginalValue = Parameter.boolParam(
+                "normalizer_skip_store_original_value",
+                false,


Agreed, let's keep this mapping attribute immutable.

martijnvg · 2025-10-23T09:43:54Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+                "normalizer_skip_store_original_value",
+                false,
+                m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(),
+                () -> "lowercase".equals(normalizer.getValue())


Normalizer mapping attribute can contain any value, because it is allowed to define custom normalizers via index settings. So I think it is best to keep this as a string.

martijnvg · 2025-10-23T09:48:17Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+            this.normalizerSkipStoreOriginalValue = Parameter.boolParam(
+                "normalizer_skip_store_original_value",
+                false,
+                m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(),


I don't think we should add validations or warnings. A cluster can start with synthetic source, but then fall back to basic and then use stored source, this shouldn't cause any warning or failures.

I think this mapping should just be a no-op in case source mode isn't synthetic.

martijnvg · 2025-10-23T09:52:34Z

rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/mget/90_synthetic_source.yml

+        keyword: [ "do or do not, there is no try", "may the force be with you!" ]
+        keyword_with_ignore_above: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]
+        keyword_without_doc_values: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]
+


Maybe also add a test here that uses a custom normalizer that does something else than lowercasing? (e.g. asciifolding or uppercase) And check that we retain original value for keyword field?

jordan-powers added 2 commits October 20, 2025 10:58

Add keyword parameter normalizer_skip_store_original_value

144d289

Enable normalizer_skip_store_original_value by default for lowercase

411e24a

jordan-powers requested a review from martijnvg October 21, 2025 21:29

jordan-powers self-assigned this Oct 21, 2025

jordan-powers added >breaking Team:StorageEngine :StorageEngine/Mapping The storage related side of mappings v9.3.0 labels Oct 21, 2025

Update docs/changelog/136915.yaml

35aa832

jordan-powers added 2 commits October 21, 2025 14:46

Update changelog

6f43a00

Typo

60fe9cd

Kubik42 reviewed Oct 22, 2025

View reviewed changes

jordan-powers added 3 commits October 22, 2025 14:07

Only set skip_store_original_value by default for built-in normalizer

6c7f6bf

Rename test in KeywordFieldMapperTests

d95cf95

Merge remote-tracking branch 'upstream/main' into normalized-keyword-…

357f02d

…native-synthetic-source

martijnvg approved these changes Oct 23, 2025

View reviewed changes

Implement native synthetic source for normalized keywords #136915

Are you sure you want to change the base?

Implement native synthetic source for normalized keywords #136915

Conversation

jordan-powers commented Oct 21, 2025

Uh oh!

elasticsearchmachine commented Oct 21, 2025

Uh oh!

elasticsearchmachine commented Oct 21, 2025

Uh oh!

Kubik42 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jordan-powers Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jordan-powers commented Oct 22, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jordan-powers Oct 22, 2025 •

edited

Loading