-
-
Notifications
You must be signed in to change notification settings - Fork 129
Description
It took me some time to figure out where the problem came from and how to fix it. I let AI generate the detailed explanation below for more context, but in a nutshell:
I have an Umbraco 16 project with Examine 3.7.1.
I have an index with analysis records. These records are displayed in a grid in the backoffice of Umbraco. The data comes from a Lucene index. The display of those records in the grid always work without issues.
But there are also filters on the dashboard that can filter those records and those filters don't reliably work. We traced it to the generation of the lucene query:
When the database is new (no TEMP folder on boot) or when you just rebuild the index, the filters work perfectly fine. We can see in the query that the status has the correct casing:
Generated Lucene Query: { Category: , LuceneQuery: *:* +(AnalysisStatus:*UpToDate* ) }
However, when you shut down Umbraco and start Umbraco again, the filters don't work anymore. We see in the query that the analysis status is now suddenly lower case.
Generated Lucene Query: { Category: , LuceneQuery: *:* +(AnalysisStatus:*uptodate*) }
This continues to not-work until you rebuild the index in the backoffice again. Now I let my AI agent explain what the issue is:
AI explanation
Summary
GroupedOr() and GroupedAnd() methods ignore field-specific analyzer configurations (e.g., FieldDefinitionTypes.Raw with KeywordAnalyzer) and instead use the default analyzer with LowercaseExpandedTerms=true, causing case-sensitive fields to be queried with lowercased terms.
Environment
- Examine Version: 3.x (support/3.x branch)
- Lucene.NET Version: 4.8.0-beta00016
- Target Framework: .NET 6 / .NET 8
- Context: Umbraco CMS project with custom Examine indexes for content search
Steps to Reproduce
1. Configure an Umbraco index with a Raw field
- Create an
IConfigureNamedOptions<LuceneDirectoryIndexOptions>implementation - Configure the index with
CultureInvariantWhitespaceAnalyzeras the default analyzer - Add a field definition:
new FieldDefinition("AnalysisStatus", FieldDefinitionTypes.Raw) - Register the configuration in the DI container
2. Index Umbraco content with case-sensitive values
- Create content items with an "AnalysisStatus" property
- Set values with specific casing: "UpToDate" and "OutOfDate"
- Trigger indexing (either through Umbraco backoffice rebuild or programmatically)
3. Query using GroupedOr
- Create a query using the Examine searcher:
searcher.CreateQuery("content") - Use
GroupedOr()with the field and expected values:GroupedOr(new[] { "AnalysisStatus" }, new[] { "UpToDate", "OutOfDate" }) - Execute the query and observe results
4. Observe the inconsistent behavior
- First scenario: Delete the entire Umbraco temp folder (including index files) and restart the application
- Second scenario: Restart the application without deleting the temp folder
Expected Behavior
- Query should preserve casing:
+(AnalysisStatus:UpToDate AnalysisStatus:OutOfDate) - Should return all matching documents (2 results in the example)
- Should work consistently regardless of whether index is newly created or reopened from disk
Actual Behavior
- Query terms are lowercased:
+(AnalysisStatus:uptodate AnalysisStatus:outofdate) - Returns 0 results because indexed values are "UpToDate" and "OutOfDate" (case-sensitive)
- Non-deterministic behavior:
- Sometimes works after deleting temp folder and creating fresh index
- Consistently fails on subsequent application restarts when reopening existing index from disk
Root Cause Analysis
Problem 1: Query Parser Uses Default Analyzer
- File:
Examine.Lucene\Search\LuceneSearchQuery.cs(~Line 30,CreateQueryParsermethod) - The query parser is initialized with the default analyzer (e.g.,
CultureInvariantWhitespaceAnalyzer) - It does NOT use the
PerFieldAnalyzerWrapperthat contains field-specific analyzers (likeKeywordAnalyzerfor Raw fields) - The
LowercaseExpandedTermsproperty only gets set ifLuceneSearchOptionsis explicitly provided - Without explicit options, it defaults to Lucene.NET's default:
true
Problem 2: Grouped Methods Force Query Parser Usage
- File:
Examine.Lucene\Search\LuceneSearchQueryBase.cs(~Line 485,GetMultiFieldQuerymethod) GroupedOr()andGroupedAnd()callGetMultiFieldQuery()internally- This method hardcodes
useQueryParser: truewhen callingGetFieldInternalQuery() - This forces all grouped operations through the query parser path
Problem 3: Query Parser Applies Default Analyzer Lowercasing
- File:
Examine.Lucene\Search\LuceneSearchQueryBase.cs(~Line 330,GetFieldInternalQuerymethod) - When
useQueryParser=trueandExamineness.Explicit, the code calls_queryParser.GetFieldQueryInternal() - This uses the default analyzer (not the field-specific
KeywordAnalyzer) - Lucene's query parser applies lowercasing when
LowercaseExpandedTerms=true - Result: Raw field terms are lowercased despite field configuration
Current Workarounds
Option 1: Explicitly disable lowercasing via LuceneSearchOptions
- Create
LuceneSearchOptionswithLowercaseExpandedTerms = false - Pass it when creating the query
- Apply
GroupedOr()as normal
Option 2: Use Escape() extension method
- Apply
.Escape()to each search term value - This bypasses the query parser and creates a
PhraseQuerydirectly - Example:
new[] { "UpToDate".Escape(), "OutOfDate".Escape() }
Option 3: Chain individual Field() calls instead of GroupedOr
- Use individual
Field()calls with.Escape() - Connect them with
.Or()operator - More verbose but reliable for case-sensitive fields
Additional Context
Non-Deterministic Behavior
The inconsistency between fresh index creation and reopening suggests:
- Potential state initialization issues in query parser or analyzer caching
- Thread safety concerns (code comments indicate "Query parsers are not thread safe")
- Different initialization paths when creating vs. opening existing Lucene directory
Impact on Umbraco Projects
- Affects any Umbraco site using Examine with case-sensitive fields
- Common scenarios: status enums, category codes, custom identifiers
- Particularly problematic for production environments where index rebuilds are infrequent
- Developers may not notice the issue during development (fresh index) but encounter it in production (persistent index)