-
Notifications
You must be signed in to change notification settings - Fork 2
IN 1333 - limit article feed by date via env var #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- dependencies updated - some logging related linting rules ignored Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1333
Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1333
Why these changes are being introduced: Previously the articles feed logic was producing an "all time" XML file with no date filtering applied. It was shared with us that, as this file grew each run, it was presenting problems for Symplectic. How this addresses that need: The articles feed will now respect a new env var 'ARTICLES_PUBLISH_DAYS_PAST' that, when applied, will limit the articles included in the output to only those with a PUBLISH_DATE on or after that many days ago. So if today is 2025-07-01, and this env var is set for 90 days, it will limit to rows where PUBLISH_DATE is >= 2025-04-01, roughly speaking. This will allow our weekly and/or monthly runs of Carbon to set a threshold like 90 days. Each run will be significantly smaller, likely around 5-10mb instead of 1gb. There will be some overlap in the XML file between runs, but that is just fine and was the case prior. Side effects of this change: * The Articles XML output will be significantly smaller when the env var is applied. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1333
| @property | ||
| def query(self) -> Select: # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this into a @property to be consistent with ArticlesXmlFeed.
| @property | ||
| def query(self) -> Select: # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving into a @property allows for two things:
- a bit of logic during construction
- instantiation of a
Config()object after the testing harness and env vars are setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what do you mean by "testing harness"? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, could have been more clear.
We set env vars in conftest.py as part of the _test_env fixture, but these aren't set until after imports take place in our files.
Originally, I had a:
config = Config()at the top of feed.py, but this failed because the "testing harness" -- which includes the fixtures and generally anything else you'd expect to be "ready" for testing -- was not fully ready, and so the required env vars weren't set.
It worked locally when I had env vars set, and it would have worked in prod where they are also set, but not for testing.
| Note: the use of SQLite for testing makes it quite difficult to test date filtering | ||
| of the 'm/d/yyyy' format for PUBLISH_DATE found in the data warehouse. This test | ||
| confirms that the Oracle SQL looks as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note the "Note"!
I could have expanded more in the test note, but this is because SQLite -- which is used for testing -- doesn't have a TO_DATE() function or SYSDATE constant, both of which are used in the updated Oracle query.
Suffice to say I spent some cycles on ways around this.... and from my POV it just wasn't worth the complexity to simulate the query when we can test that the query object is updated per the presence of this env var.
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! The proposed solution addresses the request by Symplectic to reduce the size of the Articles feed XML file with minimally invasive changes to the Carbon application. Ran the instructions and worked as described!
Purpose and background context
Previously the articles feed logic was producing an "all time" XML file with no date filtering applied. It was shared with us
that, as this file grew each run, it was presenting problems for Symplectic.
The articles feed will now respect a new env var
ARTICLES_PUBLISH_DAYS_PASTthat when applied will limit the articles included in the output to only those with aPUBLISH_DATEon or after that many days ago. So if today is 2025-07-01, and this env var is set for 90 days, it will limit to rows wherePUBLISH_DATEis >= 2025-04-01 (3 months roughly speaking).This will allow our weekly and/or monthly runs of Carbon to set a threshold like 90 days. Each run will be significantly smaller, likely around 5-10mb instead of 1gb. There will be some overlap in the XML file between runs, but that is just fine and was the case prior.
A Jira ticket has been created to set the env var for 90 days in Stage and Prod.
Lastly, probably worth reviewing per commit, as there were a couple of little tidying ones at the beginning that can be largely ignored.
Double lastly, I've opted for a path of little resistance with passing along env vars and configs, while trying to stay mostly consistent with previous conventions. It's not perfect, but my hope is that it's sufficient for this update.
How can a reviewer manually see the effects of these changes?
Though it is possible to run Carbon locally on Apple Silicon without a Docker container by following these instructions to setup an
i386environment, it can also be performed via a Docker container.1- Build docker image:
For the next step, the placeholder
<CONNECTION_STRING_PLACEHOLDER>has been shared in a secure channel. It should be inserted between the single quotes in the command below.2- Run docker container with env vars and mounts in command:
3- Review outputs
The process should have logging similar to:
And in the local
outputfolder you should see a new filearticles_date_limited.xmlthat has 2,132 records. This is much smaller than the previous all-time runs which clocked in around 300k records. These 2k records result in a file around 5-10mb, and should work just fine for Sympmlectic.Fiddling with the passed env var
ARTICLES_PUBLISH_DAYS_PAST=180will change the time window for the run.Includes new or updated dependencies?
YES: all dependencies updated
Changes expectations for external applications?
YES: if the env var
ARTICLES_PUBLISH_DAYS_PASTis set in Stage/Prod, the articles XML feed will be date limited by this number of days.What are the relevant tickets?