Skip to content

Conversation

@haram082
Copy link
Contributor

ASPC Course Review System: Instructor CxID Migration

Overview

This document outlines the database migration process for transitioning the ASPC (Associated Students of Pomona College) course review system from legacy instructor IDs to API-based instructor CxIDs. The migration was necessary to align the system with the Pomona API's data structure and improve data accuracy. The system originally used legacy internal instructor IDs inherited from a previous PostgreSQL database. However, the external Pomona API uses CxIDs (arbitrary numerical identifiers) for instructors.

Database Structure

Three main MongoDB collections:

  1. Instructors - Professor information
  2. Courses - Course offerings with instructor associations
  3. CourseReviews - Student reviews linked to specific courses and instructors

Initial Data State

  • 887 unique instructor names in the system
  • 26 instructors have multiple CxIDs (taught at different schools)
  • Courses stored with codes like CSCI005 HM and slugs like CSCI005-HM
  • Average instructors per course: 48.26 (clearly erroneous)

The Core Problem

Same professor can have multiple CxIDs if they taught at different schools throughout their career, creating a many-to-one mapping challenge. The legacy system's internal IDs didn't align with the API's data source, and we needed to maintain historical accuracy for existing course reviews while migrating to the new system.

Migration Process

Phase 1: Code Slug Refactoring

The Challenge: Course codes contained complex formatting with department codes, numbers, and sometimes letter suffixes (e.g., "BIOL052R", "CSCI005 HM"). School codes are strictly 2-letter identifiers, and the code_slug field needed to be consistent for proper matching with API data.

Objective: Standardize the code_slug format across all courses in the database

Process:

  • Parse course codes to extract department and number components
  • Handle edge cases with letter suffixes
  • Ensure school codes match 2-letter format
  • Standardize code_slug format (e.g., CSCI005-HM)
  • Validate consistency between course codes and slugs

Result: Clean, consistent course identifiers ready for API matching

Phase 2: Historical Course Data Collection

Objective: Build complete CxID mappings from API historical data

The Breakthrough: Instead of trying to map legacy IDs directly, we realized the API's historical data contains which specific CxID was used for each course section. This solved the critical problem of determining instructor identity for past offerings.

Process:

  • Fetch historical course data from 2002-2025 using GET api/Courses/{termKey}
  • Extract instructor CxIDs associated with each course section
  • Map instructor names → CxIDs → courses → terms
  • Build accurate historical associations by school and term

Implementation:

  • Iterate through all term keys from Fall 2002 to Spring 2025
  • Store mappings in a structured format for later phases
  • Use MongoDB aggregation pipelines to process large datasets

Phase 3: Update Instructors Collection

Objective: Add CxID arrays to existing instructor documents

Changes:

  • Add cxids: [] array field to each instructor document
  • Populate from Phase 2 mappings
  • Maintain existing name field for compatibility
  • Keep legacy IDs temporarily during transition

Example Structure:

{
  _id: ObjectId(...),
  name: "Professor Smith",
  cxids: [12345, 67890],  // Multiple if taught at different schools
  // legacy fields preserved for transition period
}

Result: Instructor documents now contain both old and new identifier systems

Phase 4: Update Courses Collection

Objective: Replace instructor IDs with CxID-based references

Changes:

  • Replace all_instructor_ids: [] with all_instructor_cxids: []
  • Populate using mappings from Phase 2
  • Match courses by code_slug and term to find correct CxIDs

Key Insight: Courses like "CSCI005 HM" and "CSCI005 PO" are the same course at different schools, but should have school-specific instructor lists based on who actually taught each section.

Result:

  • 98% course match rate between database and API
  • Average instructors per course: 48.26 → 2.81 (dramatic improvement)
  • Accurate historical instructor associations maintained

Phase 5: Update CourseReviews Collection

Objective: Link reviews to specific instructor CxIDs

Process:

  • Add instructor_cxid field to each review
  • Use Phase 2 mappings to determine correct CxID based on:
    • Course code and school
    • Term/year of the review
    • Instructor name
  • Maintain data integrity for historical reviews

Status: Implementation phase - some missing data exists where historical mappings couldn't be determined with certainty

Technical Implementation

Challenge: Complex Course Code Formatting

Issue: Course codes contained inconsistent formatting with:

  • Department codes of varying lengths
  • Course numbers with optional letter suffixes (e.g., "BIOL052R")
  • School codes that needed to be exactly 2 letters
  • Various spacing and delimiter patterns

Solution:

  • Built robust parsing logic using regex patterns
  • Standardized code_slug format across all courses
  • Validated school codes strictly as 2-letter identifiers
  • Handled edge cases through iterative testing

API Integration

Primary Endpoint:

  • GET api/Courses/{termKey} - Historical course data by term
  • Term keys from Fall 2002 through Spring 2025
  • Systematic iteration through all historical terms

Results

Migration Outcomes

  • 98% course match rate between database and API data
  • Average instructors per course: 48.26 → 2.81 (significant improvement)
  • Data quality: Dramatically improved with accurate instructor associations
  • Historical accuracy: Maintained for existing course reviews where mappings were available

System State

Completed:

  • Phase 1: Code slug standardization ✓
  • Phase 2: Historical CxID mapping collection ✓
  • Phase 3: Instructors collection updated ✓
  • Phase 4: Courses collection migrated to CxIDs ✓

In Progress:

  • Phase 5: CourseReviews collection (some missing data for older reviews)

Before vs. After

Before:

  • Legacy internal IDs disconnected from data source
  • No clear mapping to API data
  • Inflated instructor counts due to mapping errors
  • Inconsistent code_slug formatting

After:

  • Direct CxID alignment with Pomona API
  • Accurate historical instructor associations (where data available)
  • Clean, consistent course identifiers
  • Realistic instructor counts per course

@vercel
Copy link

vercel bot commented Nov 19, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
pomonastudents Ready Ready Preview Comment Nov 19, 2025 1:29am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants