Skip to content

GroupedOr/GroupedAnd Ignore Field-Specific Analyzers and Lowercase Raw Field Terms #419

@Luuk1983

Description

@Luuk1983

It took me some time to figure out where the problem came from and how to fix it. I let AI generate the detailed explanation below for more context, but in a nutshell:

I have an Umbraco 16 project with Examine 3.7.1.
I have an index with analysis records. These records are displayed in a grid in the backoffice of Umbraco. The data comes from a Lucene index. The display of those records in the grid always work without issues.

But there are also filters on the dashboard that can filter those records and those filters don't reliably work. We traced it to the generation of the lucene query:

When the database is new (no TEMP folder on boot) or when you just rebuild the index, the filters work perfectly fine. We can see in the query that the status has the correct casing:
Generated Lucene Query: { Category: , LuceneQuery: *:* +(AnalysisStatus:*UpToDate* ) }

However, when you shut down Umbraco and start Umbraco again, the filters don't work anymore. We see in the query that the analysis status is now suddenly lower case.
Generated Lucene Query: { Category: , LuceneQuery: *:* +(AnalysisStatus:*uptodate*) }

This continues to not-work until you rebuild the index in the backoffice again. Now I let my AI agent explain what the issue is:


AI explanation

Summary

GroupedOr() and GroupedAnd() methods ignore field-specific analyzer configurations (e.g., FieldDefinitionTypes.Raw with KeywordAnalyzer) and instead use the default analyzer with LowercaseExpandedTerms=true, causing case-sensitive fields to be queried with lowercased terms.

Environment

  • Examine Version: 3.x (support/3.x branch)
  • Lucene.NET Version: 4.8.0-beta00016
  • Target Framework: .NET 6 / .NET 8
  • Context: Umbraco CMS project with custom Examine indexes for content search

Steps to Reproduce

1. Configure an Umbraco index with a Raw field

  • Create an IConfigureNamedOptions<LuceneDirectoryIndexOptions> implementation
  • Configure the index with CultureInvariantWhitespaceAnalyzer as the default analyzer
  • Add a field definition: new FieldDefinition("AnalysisStatus", FieldDefinitionTypes.Raw)
  • Register the configuration in the DI container

2. Index Umbraco content with case-sensitive values

  • Create content items with an "AnalysisStatus" property
  • Set values with specific casing: "UpToDate" and "OutOfDate"
  • Trigger indexing (either through Umbraco backoffice rebuild or programmatically)

3. Query using GroupedOr

  • Create a query using the Examine searcher: searcher.CreateQuery("content")
  • Use GroupedOr() with the field and expected values: GroupedOr(new[] { "AnalysisStatus" }, new[] { "UpToDate", "OutOfDate" })
  • Execute the query and observe results

4. Observe the inconsistent behavior

  • First scenario: Delete the entire Umbraco temp folder (including index files) and restart the application
  • Second scenario: Restart the application without deleting the temp folder

Expected Behavior

  • Query should preserve casing: +(AnalysisStatus:UpToDate AnalysisStatus:OutOfDate)
  • Should return all matching documents (2 results in the example)
  • Should work consistently regardless of whether index is newly created or reopened from disk

Actual Behavior

  • Query terms are lowercased: +(AnalysisStatus:uptodate AnalysisStatus:outofdate)
  • Returns 0 results because indexed values are "UpToDate" and "OutOfDate" (case-sensitive)
  • Non-deterministic behavior:
    • Sometimes works after deleting temp folder and creating fresh index
    • Consistently fails on subsequent application restarts when reopening existing index from disk

Root Cause Analysis

Problem 1: Query Parser Uses Default Analyzer

  • File: Examine.Lucene\Search\LuceneSearchQuery.cs (~Line 30, CreateQueryParser method)
  • The query parser is initialized with the default analyzer (e.g., CultureInvariantWhitespaceAnalyzer)
  • It does NOT use the PerFieldAnalyzerWrapper that contains field-specific analyzers (like KeywordAnalyzer for Raw fields)
  • The LowercaseExpandedTerms property only gets set if LuceneSearchOptions is explicitly provided
  • Without explicit options, it defaults to Lucene.NET's default: true

Problem 2: Grouped Methods Force Query Parser Usage

  • File: Examine.Lucene\Search\LuceneSearchQueryBase.cs (~Line 485, GetMultiFieldQuery method)
  • GroupedOr() and GroupedAnd() call GetMultiFieldQuery() internally
  • This method hardcodes useQueryParser: true when calling GetFieldInternalQuery()
  • This forces all grouped operations through the query parser path

Problem 3: Query Parser Applies Default Analyzer Lowercasing

  • File: Examine.Lucene\Search\LuceneSearchQueryBase.cs (~Line 330, GetFieldInternalQuery method)
  • When useQueryParser=true and Examineness.Explicit, the code calls _queryParser.GetFieldQueryInternal()
  • This uses the default analyzer (not the field-specific KeywordAnalyzer)
  • Lucene's query parser applies lowercasing when LowercaseExpandedTerms=true
  • Result: Raw field terms are lowercased despite field configuration

Current Workarounds

Option 1: Explicitly disable lowercasing via LuceneSearchOptions

  • Create LuceneSearchOptions with LowercaseExpandedTerms = false
  • Pass it when creating the query
  • Apply GroupedOr() as normal

Option 2: Use Escape() extension method

  • Apply .Escape() to each search term value
  • This bypasses the query parser and creates a PhraseQuery directly
  • Example: new[] { "UpToDate".Escape(), "OutOfDate".Escape() }

Option 3: Chain individual Field() calls instead of GroupedOr

  • Use individual Field() calls with .Escape()
  • Connect them with .Or() operator
  • More verbose but reliable for case-sensitive fields

Additional Context

Non-Deterministic Behavior

The inconsistency between fresh index creation and reopening suggests:

  • Potential state initialization issues in query parser or analyzer caching
  • Thread safety concerns (code comments indicate "Query parsers are not thread safe")
  • Different initialization paths when creating vs. opening existing Lucene directory

Impact on Umbraco Projects

  • Affects any Umbraco site using Examine with case-sensitive fields
  • Common scenarios: status enums, category codes, custom identifiers
  • Particularly problematic for production environments where index rebuilds are infrequent
  • Developers may not notice the issue during development (fresh index) but encounter it in production (persistent index)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions