Skip to content

Migrate TextValue data to enum values for 6 MIxS slots #2772

@turbomam

Description

@turbomam

Summary

During the MIxS migration to GSC commit 0368da846b197bef1c0dd27a9cf337a8aeea17f2, we identified 6 slots where GSC changed the range from string/TextValue to enum types. NMDC production data currently uses TextValue objects for these slots, so we override the range back to TextValue for backward compatibility.

This issue tracks the future migration of this data to use the proper enum values.

Affected Slots

Slot GSC Enum Range NMDC Current Range Biosamples with Data
crop_rotation CropRotationEnum TextValue TBD
cult_root_med CultRootMedEnum TextValue 140
gravidity GravidityEnum TextValue TBD
perturbation PerturbationEnum TextValue TBD
soil_type FaoClassEnum TextValue TBD
store_cond StoreCondEnum TextValue 3,910

Current State

The yq transformations in assets/yq-for-mixs_subset_modified.txt override these slots:

# Restore TextValue range for slots where NMDC MongoDB has TextValue data
'.slots.crop_rotation.range |= "TextValue"'
'.slots.cult_root_med.range |= "TextValue"'
'.slots.gravidity.range |= "TextValue"'
'.slots.perturbation.range |= "TextValue"'
'.slots.soil_type.range |= "TextValue"'
'.slots.store_cond.range |= "TextValue"'

Migration Approach

For each slot:

  1. Analyze existing data

    • Query MongoDB for all unique has_raw_value values
    • Document the value distribution
  2. Map to enum values

    • Compare existing values to GSC enum permissible values
    • Identify exact matches, partial matches, and unmappable values
    • Decide on mapping strategy (exact match, normalization, or custom enum extension)
  3. Create migration script

    • Transform {type: "nmdc:TextValue", has_raw_value: "..."}"enum_value"
    • Handle edge cases and unmappable values
  4. Update schema

    • Remove TextValue range override from yq file
    • Accept GSC enum range (or extend enum if needed)
  5. Execute migration

    • Run migration on MongoDB
    • Validate all biosamples against updated schema

Example: store_cond

Current data format:

{"type": "nmdc:TextValue", "has_raw_value": "frozen"}

GSC StoreCondEnum permissible values need to be checked for compatibility.

Target format:

"frozen"  // or appropriate enum value

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions