-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Implement native synthetic source for normalized keywords #136915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implement native synthetic source for normalized keywords #136915
Conversation
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
Hi @jordan-powers, I've created a changelog YAML for you. Note that since this PR is labelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Looks good to me for the most part.
One concern I have is around naming- "skip_store" combined with false, results in a double negative, which might be a bit unclear to customers. Perhaps we can drop "skip" and instead flip the default around like so:
normalizer_store_original_value = true (default)
normalizer_store_original_value = false
keyword: | ||
type: keyword | ||
normalizer: lowercase | ||
normalizer_skip_store_original_value: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be configured if it defaults to false anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It defaults to true for the lowercase
normalizer, so I have to explicitly configure it to false
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this test is defining a custom normalizer called "lowercase". Currently the default logic will trigger for any normalizer called "lowercase", no matter if it's the built-in one or a custom one shadowing the built-in one. I'll look into what it would take to only set the default for the built-in one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this should be addressed as of 6c7f6bf
"normalizer_skip_store_original_value", | ||
false, | ||
m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(), | ||
() -> "lowercase".equals(normalizer.getValue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] do you think it makes sense to extract "lowercase" into an enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll try and extract it into some form of constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this, and the built-in lowercase normalizer is registered with a string literal "lowercase"
(code). I could extract that out into a constant, then reference the constant here, but I'm reluctant to touch the AnalysisModule in this PR.
I think it's probably fine to just leave the string literal here, but if we do want to extract out the constant, I think it'd make sense as a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalizer mapping attribute can contain any value, because it is allowed to define custom normalizers via index settings. So I think it is best to keep this as a string.
assertEquals(new BytesRef("foo"), doc.rootDoc().getField("field2").binaryValue()); | ||
} | ||
|
||
public void testNormalizerSyntheticSource() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] name could be more descriptive, like the other tests below it. Maybe testNormalizerSyntheticSourceWhenSkipStoreOriginalValueDisabled
?
this.normalizerSkipStoreOriginalValue = Parameter.boolParam( | ||
"normalizer_skip_store_original_value", | ||
false, | ||
m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow customers to set this in the first place when synthetic source is not enabled? They might find it confusing if they enable it but it doesn't work for whatever reason. In reality, they just don't have synthetic source enabled.
Just a question. I'm not sure how we normally deal with parameters unique to synthetic source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point, although I'm not sure I would want to completely disallow it and make setting it on a non-synthetic index a fatal invalid mapping exception. Maybe just a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should add validations or warnings. A cluster can start with synthetic source, but then fall back to basic and then use stored source, this shouldn't cause any warning or failures.
I think this mapping should just be a no-op in case source mode isn't synthetic.
).acceptsNull(); | ||
this.normalizerSkipStoreOriginalValue = Parameter.boolParam( | ||
"normalizer_skip_store_original_value", | ||
false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it ok to go from false -> true here? Although probably not ok the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would work. Going from false -> true would cause previously indexed documents to start returning their normalized values, and the original values stored in _ignored_source would be unused.
I'm reluctant to allow it though, as I think it might create some unnecessary confusion. It's simpler to reason about and to debug if the values are all stored the same way throughout the life of the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's keep this mapping attribute immutable.
Yeah, I went back and forth on whether to name it But really I could justify either name for the parameter, so if calling it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one small testing comment, LGTM otherwise.
).acceptsNull(); | ||
this.normalizerSkipStoreOriginalValue = Parameter.boolParam( | ||
"normalizer_skip_store_original_value", | ||
false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's keep this mapping attribute immutable.
"normalizer_skip_store_original_value", | ||
false, | ||
m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(), | ||
() -> "lowercase".equals(normalizer.getValue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalizer mapping attribute can contain any value, because it is allowed to define custom normalizers via index settings. So I think it is best to keep this as a string.
this.normalizerSkipStoreOriginalValue = Parameter.boolParam( | ||
"normalizer_skip_store_original_value", | ||
false, | ||
m -> ((KeywordFieldMapper) m).isNormalizerSkipStoreOriginalValue(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should add validations or warnings. A cluster can start with synthetic source, but then fall back to basic and then use stored source, this shouldn't cause any warning or failures.
I think this mapping should just be a no-op in case source mode isn't synthetic.
keyword: [ "do or do not, there is no try", "may the force be with you!" ] | ||
keyword_with_ignore_above: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ] | ||
keyword_without_doc_values: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a test here that uses a custom normalizer that does something else than lowercasing? (e.g. asciifolding
or uppercase
) And check that we retain original value for keyword
field?
Currently, when a synthetic source index has a keyword field with a normalizer, the original, non-normalized value of the field is stored in _ignored_source so that the original source can be reconstructed. However, this can create significant storage overhead as we are essentially double-storing the value.
This PR adds a new boolean keyword mapper parameter
normalizer_skip_store_original_value
. When this value is set, the original value is not stored in _ignored_source and is instead discarded. The source will be reconstructed using the normalized value.For custom normalizers, this parameter will default to
false
and the original value will be stored. However, for the built-inlowercase
normalizer, the parameter will default totrue
and the original value will not be stored.This is a breaking change as previously keyword field mappers with the lowercase normalizer would default to storing the original value.
Relates to #124369.