Skip to content

Conversation

@PredictiveManish
Copy link
Contributor

Description
This PR converts current_release_date and latest_release_date from string type to Date format.

This PR fixes #2663

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

Signed-off-by: PredictiveManish <[email protected]>
Signed-off-by: PredictiveManish <[email protected]>
Copy link
Contributor

@MoralCode MoralCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start overall! It seems like you were very mindful of existing data, you just happened to implement the migration in a very old migration file rather than creating a new one with a higher number such that existing databases know they need to update to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct file to update. When making schema changes, a new file should be created (i.e. create a new migration file with a higer number and use the upgrade() and downgrade() functions to apply/undo the schema change (especially being mindful to handle any data that may exist in someones DB at the time of this change)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, working in that direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unnecessary changes

Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some assumptions in this PR that i do not think hold up against the nature of the data sources for LibYear.

latest_version = Column(String)
current_release_date = Column(String)
latest_release_date = Column(String)
current_release_date = Column(TIMESTAMP(precision=0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is logical to convert dates to timestamps or another date type, the data is stored, in this instance, as a VARCHAR in the table. This almost surely arises from the inconsistency of package managers and how we obtain this data. I don't think this change is going to work without breaking some of the logic embedded in the source data; because the source data is heterogeneous. This change is simply likely to be incompatible, as illogical as that seems.

Copy link
Contributor

@MoralCode MoralCode Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgoggins as the operator of the largest augur instance I know of, can you do an audit of these fields to see if you can actually observe any such date inconsistencies that would be problematic? or at least grab a list of samples of the variety of date values in this column so we can understand the kind of incompatibility you suspect/are thinking about?

Dates are one of the more structured/precisely defined types and python has some pretty good date parsing libraries. If Augur downstreams can at least somewhat use the data in this column, i feel like we can also help take on some of that data cleaning effort to make things easier on downstream projects. And if there are serious compatibility issues, we could introduce this change by creating a new text field where we populate the values that cant be processed into a valid date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MoralCode : After looking more closely at the data we are saving, I think some of the existing values have timezones, and others do not. They do all appear to be "valid" dates. I think the challenge then becomes ensuring the proper conversion of already in place data, which will require some experimentation, I think.

@PredictiveManish :

I Ran this on my instance with ~150k repos and got no rows that were "not convertable" to a timestamped TimeZoned datatype:

DROP TABLE IF EXISTS _rdl_bad;

CREATE TEMP TABLE _rdl_bad AS
WITH s AS (
  SELECT
    t.ctid AS row_id,
    t.current_release_date,
    t.latest_release_date,

    COALESCE(
      CASE WHEN t.current_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?(Z|[+-]\d{2}:?\d{2})$'
           THEN t.current_release_date::timestamptz END,
      CASE WHEN t.current_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?$'
           THEN (t.current_release_date::timestamp) AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.current_release_date ~* '^\d{4}-\d{2}-\d{2}$'
           THEN (t.current_release_date::date)::timestamp AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.current_release_date ~* '^\d{8}$'
           THEN to_timestamp(t.current_release_date, 'YYYYMMDD') AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}:\d{2}$'
           THEN to_timestamp(t.current_release_date, 'MM/DD/YYYY HH24:MI:SS') AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}$'
           THEN to_timestamp(t.current_release_date, 'MM/DD/YYYY HH24:MI') AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4}$'
           THEN to_timestamp(t.current_release_date, 'MM/DD/YYYY') AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.current_release_date ~ '^\d+(\.\d+)?$'
           THEN to_timestamp(t.current_release_date::double precision) END
    ) AS parsed_current,

    COALESCE(
      CASE WHEN t.latest_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?(Z|[+-]\d{2}:?\d{2})$'
           THEN t.latest_release_date::timestamptz END,
      CASE WHEN t.latest_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?$'
           THEN (t.latest_release_date::timestamp) AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.latest_release_date ~* '^\d{4}-\d{2}-\d{2}$'
           THEN (t.latest_release_date::date)::timestamp AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.latest_release_date ~* '^\d{8}$'
           THEN to_timestamp(t.latest_release_date, 'YYYYMMDD') AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}:\d{2}$'
           THEN to_timestamp(t.latest_release_date, 'MM/DD/YYYY HH24:MI:SS') AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}$'
           THEN to_timestamp(t.latest_release_date, 'MM/DD/YYYY HH24:MI') AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4}$'
           THEN to_timestamp(t.latest_release_date, 'MM/DD/YYYY') AT TIME ZONE current_setting('TimeZone') END,
      CASE WHEN t.latest_release_date ~ '^\d+(\.\d+)?$'
           THEN to_timestamp(t.latest_release_date::double precision) END
    ) AS parsed_latest
  FROM repo_deps_libyear t
)
SELECT row_id, current_release_date, latest_release_date
FROM s
WHERE (current_release_date IS NOT NULL AND parsed_current IS NULL)
   OR (latest_release_date   IS NOT NULL AND parsed_latest  IS NULL);

select * from _rdl_bad; 

========

THEN, I think the upgrade script needs to do a bit more than change the datatype. I think best practice would be to check if any rows in an augur db are not conversion-ready ... but it also appears that we are doing a pretty good job storing the text in a timestamp convertable format... so, i think something like this in the update/SQLAlchemy script is needed ... I do not entirely trust straight up conversion, especially since there are a range of distinct, but valid formats to convert.

The new datatype should be a TIMEZONE'd TIMESTAMP if that's not it already (I think its just a timestamp now).

ALTER TABLE repo_deps_libyear
  ALTER COLUMN current_release_date TYPE timestamptz
    USING (
      COALESCE(
        CASE WHEN current_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?(Z|[+-]\d{2}:?\d{2})$'
             THEN current_release_date::timestamptz END,
        CASE WHEN current_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?$'
             THEN (current_release_date::timestamp) AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN current_release_date ~* '^\d{4}-\d{2}-\d{2}$'
             THEN (current_release_date::date)::timestamp AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN current_release_date ~* '^\d{8}$'
             THEN to_timestamp(current_release_date, 'YYYYMMDD') AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}:\d{2}$'
             THEN to_timestamp(current_release_date, 'MM/DD/YYYY HH24:MI:SS') AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}$'
             THEN to_timestamp(current_release_date, 'MM/DD/YYYY HH24:MI') AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4}$'
             THEN to_timestamp(current_release_date, 'MM/DD/YYYY') AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN current_release_date ~ '^\d+(\.\d+)?$'
             THEN to_timestamp(current_release_date::double precision) END
      )
    ),
  ALTER COLUMN latest_release_date TYPE timestamptz
    USING (
      COALESCE(
        CASE WHEN latest_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?(Z|[+-]\d{2}:?\d{2})$'
             THEN latest_release_date::timestamptz END,
        CASE WHEN latest_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?$'
             THEN (latest_release_date::timestamp) AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN latest_release_date ~* '^\d{4}-\d{2}-\d{2}$'
             THEN (latest_release_date::date)::timestamp AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN latest_release_date ~* '^\d{8}$'
             THEN to_timestamp(latest_release_date, 'YYYYMMDD') AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}:\d{2}$'
             THEN to_timestamp(latest_release_date, 'MM/DD/YYYY HH24:MI:SS') AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}$'
             THEN to_timestamp(latest_release_date, 'MM/DD/YYYY HH24:MI') AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4}$'
             THEN to_timestamp(latest_release_date, 'MM/DD/YYYY') AT TIME ZONE current_setting('TimeZone') END,
        CASE WHEN latest_release_date ~ '^\d+(\.\d+)?$'
             THEN to_timestamp(latest_release_date::double precision) END
      )
    );

add_user_repo_table_12()
add_materialized_views_13()
set_repo_name_path_null_14()
convert_repo_deps_libyear_dates_15()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a serialization and deserialization of all schema changes over time. Current instances will break or be inconsistent and broken if we add things mid-construction like this; It'll be ok on any new instances, but problematic for existing ones.

@sgoggins sgoggins self-assigned this Nov 11, 2025
@sgoggins sgoggins added the database Related to Augur's unifed data model label Nov 11, 2025
@MoralCode MoralCode self-assigned this Nov 11, 2025
@PredictiveManish
Copy link
Contributor Author

@MoralCode are you working on this now? or I can continue? as I got to know that now changes need new files and this is 35th and I will update- remove the suggestions and update accordingly.

@MoralCode
Copy link
Contributor

This change is going to require more than just a schema change. We discussed it briefly at the end of today's Augur call and Sean mentioned that this is likely to require a fair degree of logic changes as well, especially to make sure that various error cases get handled when converting date values (which come from various sources and may all be in different formats).

Also since this is a breaking schema change, I think its going to get through review faster if an existing maintainer can handle it.

I can't stop you from working on it, but i will likely be adopting this change and working with the maintainers to figure out the best way to handle it (along with lots of other database changes) - especially given similar/related issues like #1370

@PredictiveManish
Copy link
Contributor Author

This change is going to require more than just a schema change. We discussed it briefly at the end of today's Augur call and Sean mentioned that this is likely to require a fair degree of logic changes as well, especially to make sure that various error cases get handled when converting date values (which come from various sources and may all be in different formats).

Also since this is a breaking schema change, I think its going to get through review faster if an existing maintainer can handle it.

I can't stop you from working on it, but i will likely be adopting this change and working with the maintainers to figure out the best way to handle it (along with lots of other database changes) - especially given similar/related issues like #1370

Ok. got your point will work on this so that can understand database system in the augur deeply and manage to solve other issues related to database.

@sgoggins
Copy link
Member

sgoggins commented Dec 9, 2025

@MoralCode : We probably should discuss this one and decide how it intersects with the other db updates you've been working on.

@sgoggins sgoggins added the discussion Seeking active feedback, usually for items under active development label Dec 9, 2025
@MoralCode
Copy link
Contributor

MoralCode commented Dec 11, 2025

We probably should discuss this one and decide how it intersects with the other db updates you've been working on.

This is a pretty major fix, possibly a breaking change that is very likely to require some soecial handling/special consideration for how to migrate existing data, which can be in just about any date format. I dont really know how to resolve this cleanly and its probably going to be a pretty big effort that may not be worth it for a while.

@MoralCode MoralCode modified the milestone: v0.93.0 Dec 11, 2025
@MoralCode
Copy link
Contributor

Closing this PR as I dont see it getting merged anytime soon due to the aforementioned complexities surrounding handling peoples existing data from Augur instances of various ages

@MoralCode MoralCode closed this Dec 19, 2025
@PredictiveManish PredictiveManish deleted the string-to-date branch December 19, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

database Related to Augur's unifed data model discussion Seeking active feedback, usually for items under active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

repo_deps_libyear current_release_date and latest_release_date data type

3 participants