-
Notifications
You must be signed in to change notification settings - Fork 964
Fix: String to Date #2662 #3384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: PredictiveManish <[email protected]>
Signed-off-by: PredictiveManish <[email protected]>
MoralCode
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start overall! It seems like you were very mindful of existing data, you just happened to implement the migration in a very old migration file rather than creating a new one with a higher number such that existing databases know they need to update to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct file to update. When making schema changes, a new file should be created (i.e. create a new migration file with a higer number and use the upgrade() and downgrade() functions to apply/undo the schema change (especially being mindful to handle any data that may exist in someones DB at the time of this change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, working in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the unnecessary changes
sgoggins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some assumptions in this PR that i do not think hold up against the nature of the data sources for LibYear.
| latest_version = Column(String) | ||
| current_release_date = Column(String) | ||
| latest_release_date = Column(String) | ||
| current_release_date = Column(TIMESTAMP(precision=0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is logical to convert dates to timestamps or another date type, the data is stored, in this instance, as a VARCHAR in the table. This almost surely arises from the inconsistency of package managers and how we obtain this data. I don't think this change is going to work without breaking some of the logic embedded in the source data; because the source data is heterogeneous. This change is simply likely to be incompatible, as illogical as that seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgoggins as the operator of the largest augur instance I know of, can you do an audit of these fields to see if you can actually observe any such date inconsistencies that would be problematic? or at least grab a list of samples of the variety of date values in this column so we can understand the kind of incompatibility you suspect/are thinking about?
Dates are one of the more structured/precisely defined types and python has some pretty good date parsing libraries. If Augur downstreams can at least somewhat use the data in this column, i feel like we can also help take on some of that data cleaning effort to make things easier on downstream projects. And if there are serious compatibility issues, we could introduce this change by creating a new text field where we populate the values that cant be processed into a valid date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MoralCode : After looking more closely at the data we are saving, I think some of the existing values have timezones, and others do not. They do all appear to be "valid" dates. I think the challenge then becomes ensuring the proper conversion of already in place data, which will require some experimentation, I think.
I Ran this on my instance with ~150k repos and got no rows that were "not convertable" to a timestamped TimeZoned datatype:
DROP TABLE IF EXISTS _rdl_bad;
CREATE TEMP TABLE _rdl_bad AS
WITH s AS (
SELECT
t.ctid AS row_id,
t.current_release_date,
t.latest_release_date,
COALESCE(
CASE WHEN t.current_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?(Z|[+-]\d{2}:?\d{2})$'
THEN t.current_release_date::timestamptz END,
CASE WHEN t.current_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?$'
THEN (t.current_release_date::timestamp) AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.current_release_date ~* '^\d{4}-\d{2}-\d{2}$'
THEN (t.current_release_date::date)::timestamp AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.current_release_date ~* '^\d{8}$'
THEN to_timestamp(t.current_release_date, 'YYYYMMDD') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}:\d{2}$'
THEN to_timestamp(t.current_release_date, 'MM/DD/YYYY HH24:MI:SS') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}$'
THEN to_timestamp(t.current_release_date, 'MM/DD/YYYY HH24:MI') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4}$'
THEN to_timestamp(t.current_release_date, 'MM/DD/YYYY') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.current_release_date ~ '^\d+(\.\d+)?$'
THEN to_timestamp(t.current_release_date::double precision) END
) AS parsed_current,
COALESCE(
CASE WHEN t.latest_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?(Z|[+-]\d{2}:?\d{2})$'
THEN t.latest_release_date::timestamptz END,
CASE WHEN t.latest_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?$'
THEN (t.latest_release_date::timestamp) AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.latest_release_date ~* '^\d{4}-\d{2}-\d{2}$'
THEN (t.latest_release_date::date)::timestamp AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.latest_release_date ~* '^\d{8}$'
THEN to_timestamp(t.latest_release_date, 'YYYYMMDD') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}:\d{2}$'
THEN to_timestamp(t.latest_release_date, 'MM/DD/YYYY HH24:MI:SS') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}$'
THEN to_timestamp(t.latest_release_date, 'MM/DD/YYYY HH24:MI') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4}$'
THEN to_timestamp(t.latest_release_date, 'MM/DD/YYYY') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN t.latest_release_date ~ '^\d+(\.\d+)?$'
THEN to_timestamp(t.latest_release_date::double precision) END
) AS parsed_latest
FROM repo_deps_libyear t
)
SELECT row_id, current_release_date, latest_release_date
FROM s
WHERE (current_release_date IS NOT NULL AND parsed_current IS NULL)
OR (latest_release_date IS NOT NULL AND parsed_latest IS NULL);
select * from _rdl_bad; ========
THEN, I think the upgrade script needs to do a bit more than change the datatype. I think best practice would be to check if any rows in an augur db are not conversion-ready ... but it also appears that we are doing a pretty good job storing the text in a timestamp convertable format... so, i think something like this in the update/SQLAlchemy script is needed ... I do not entirely trust straight up conversion, especially since there are a range of distinct, but valid formats to convert.
The new datatype should be a TIMEZONE'd TIMESTAMP if that's not it already (I think its just a timestamp now).
ALTER TABLE repo_deps_libyear
ALTER COLUMN current_release_date TYPE timestamptz
USING (
COALESCE(
CASE WHEN current_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?(Z|[+-]\d{2}:?\d{2})$'
THEN current_release_date::timestamptz END,
CASE WHEN current_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?$'
THEN (current_release_date::timestamp) AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN current_release_date ~* '^\d{4}-\d{2}-\d{2}$'
THEN (current_release_date::date)::timestamp AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN current_release_date ~* '^\d{8}$'
THEN to_timestamp(current_release_date, 'YYYYMMDD') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}:\d{2}$'
THEN to_timestamp(current_release_date, 'MM/DD/YYYY HH24:MI:SS') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}$'
THEN to_timestamp(current_release_date, 'MM/DD/YYYY HH24:MI') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN current_release_date ~* '^\d{1,2}/\d{1,2}/\d{4}$'
THEN to_timestamp(current_release_date, 'MM/DD/YYYY') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN current_release_date ~ '^\d+(\.\d+)?$'
THEN to_timestamp(current_release_date::double precision) END
)
),
ALTER COLUMN latest_release_date TYPE timestamptz
USING (
COALESCE(
CASE WHEN latest_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?(Z|[+-]\d{2}:?\d{2})$'
THEN latest_release_date::timestamptz END,
CASE WHEN latest_release_date ~* '^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}(:\d{2}(\.\d+)?)?$'
THEN (latest_release_date::timestamp) AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN latest_release_date ~* '^\d{4}-\d{2}-\d{2}$'
THEN (latest_release_date::date)::timestamp AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN latest_release_date ~* '^\d{8}$'
THEN to_timestamp(latest_release_date, 'YYYYMMDD') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}:\d{2}$'
THEN to_timestamp(latest_release_date, 'MM/DD/YYYY HH24:MI:SS') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4} \d{2}:\d{2}$'
THEN to_timestamp(latest_release_date, 'MM/DD/YYYY HH24:MI') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN latest_release_date ~* '^\d{1,2}/\d{1,2}/\d{4}$'
THEN to_timestamp(latest_release_date, 'MM/DD/YYYY') AT TIME ZONE current_setting('TimeZone') END,
CASE WHEN latest_release_date ~ '^\d+(\.\d+)?$'
THEN to_timestamp(latest_release_date::double precision) END
)
);| add_user_repo_table_12() | ||
| add_materialized_views_13() | ||
| set_repo_name_path_null_14() | ||
| convert_repo_deps_libyear_dates_15() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a serialization and deserialization of all schema changes over time. Current instances will break or be inconsistent and broken if we add things mid-construction like this; It'll be ok on any new instances, but problematic for existing ones.
|
@MoralCode are you working on this now? or I can continue? as I got to know that now changes need new files and this is 35th and I will update- remove the suggestions and update accordingly. |
|
This change is going to require more than just a schema change. We discussed it briefly at the end of today's Augur call and Sean mentioned that this is likely to require a fair degree of logic changes as well, especially to make sure that various error cases get handled when converting date values (which come from various sources and may all be in different formats). Also since this is a breaking schema change, I think its going to get through review faster if an existing maintainer can handle it. I can't stop you from working on it, but i will likely be adopting this change and working with the maintainers to figure out the best way to handle it (along with lots of other database changes) - especially given similar/related issues like #1370 |
Ok. got your point will work on this so that can understand database system in the augur deeply and manage to solve other issues related to database. |
Signed-off-by: PredictiveManish <[email protected]>
Signed-off-by: PredictiveManish <[email protected]>
|
@MoralCode : We probably should discuss this one and decide how it intersects with the other db updates you've been working on. |
This is a pretty major fix, possibly a breaking change that is very likely to require some soecial handling/special consideration for how to migrate existing data, which can be in just about any date format. I dont really know how to resolve this cleanly and its probably going to be a pretty big effort that may not be worth it for a while. |
|
Closing this PR as I dont see it getting merged anytime soon due to the aforementioned complexities surrounding handling peoples existing data from Augur instances of various ages |
Description
This PR converts
current_release_dateandlatest_release_datefrom string type to Date format.This PR fixes #2663
Notes for Reviewers
Signed commits