Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Jul 2, 2025

Purpose and background context

Previously the articles feed logic was producing an "all time" XML file with no date filtering applied. It was shared with us
that, as this file grew each run, it was presenting problems for Symplectic.

The articles feed will now respect a new env var ARTICLES_PUBLISH_DAYS_PAST that when applied will limit the articles included in the output to only those with a PUBLISH_DATE on or after that many days ago. So if today is 2025-07-01, and this env var is set for 90 days, it will limit to rows where PUBLISH_DATE is >= 2025-04-01 (3 months roughly speaking).

This will allow our weekly and/or monthly runs of Carbon to set a threshold like 90 days. Each run will be significantly smaller, likely around 5-10mb instead of 1gb. There will be some overlap in the XML file between runs, but that is just fine and was the case prior.

A Jira ticket has been created to set the env var for 90 days in Stage and Prod.

Lastly, probably worth reviewing per commit, as there were a couple of little tidying ones at the beginning that can be largely ignored.

Double lastly, I've opted for a path of little resistance with passing along env vars and configs, while trying to stay mostly consistent with previous conventions. It's not perfect, but my hope is that it's sufficient for this update.

How can a reviewer manually see the effects of these changes?

Though it is possible to run Carbon locally on Apple Silicon without a Docker container by following these instructions to setup an i386 environment, it can also be performed via a Docker container.

1- Build docker image:

make dist-dev

For the next step, the placeholder <CONNECTION_STRING_PLACEHOLDER> has been shared in a secure channel. It should be inserted between the single quotes in the command below.

2- Run docker container with env vars and mounts in command:

docker run \
-e WORKSPACE=dev \
-e LOG_LEVEL=debug \
-e SENTRY_DSN=None \
-e SNS_TOPIC_ARN="will-not-use" \
-e SYMPLECTIC_FTP_JSON='{}' \
-e SYMPLECTIC_FTP_PATH="will-not-use" \
-e DATAWAREHOUSE_CLOUDCONNECTOR_JSON='<CONNECTION_STRING_PLACEHOLDER>' \
-e FEED_TYPE=articles \
-e ARTICLES_PUBLISH_DAYS_PAST=180 \
-v ./output:/tmp \
carbon-dev \
-o /tmp/articles_date_limited.xml \
--ignore_sns_logging

3- Review outputs

The process should have logging similar to:

2025-07-02 13:55:19,599 INFO carbon.database.run_connection_test() line 113: Testing connection to the Data Warehouse
2025-07-02 13:55:46,013 INFO carbon.database.run_connection_test() line 129: Successfully connected to the Data Warehouse: 10.2.0.5.0
2025-07-02 13:55:46,087 INFO carbon.cli.main() line 83: Carbon run for the 'articles' feed has started.
2025-07-02 13:55:46,098 INFO root.configure_logger() line 62: Logger 'root' configured with level=INFO
2025-07-02 13:55:46,098 INFO root.configure_sentry() line 78: No Sentry DSN found, exceptions will not be sent to Sentry
2025-07-02 13:56:45,663 INFO carbon.app.write() line 82: The 'articles' feed has processed 2132 records.
2025-07-02 13:56:45,665 INFO carbon.cli.main() line 93: Carbon run has successfully completed.

And in the local output folder you should see a new file articles_date_limited.xml that has 2,132 records. This is much smaller than the previous all-time runs which clocked in around 300k records. These 2k records result in a file around 5-10mb, and should work just fine for Sympmlectic.

Fiddling with the passed env var ARTICLES_PUBLISH_DAYS_PAST=180 will change the time window for the run.

Includes new or updated dependencies?

YES: all dependencies updated

Changes expectations for external applications?

YES: if the env var ARTICLES_PUBLISH_DAYS_PAST is set in Stage/Prod, the articles XML feed will be date limited by this number of days.

What are the relevant tickets?

ghukill added 5 commits July 1, 2025 14:12
- dependencies updated
- some logging related linting rules ignored

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1333
Why these changes are being introduced:

Previously the articles feed logic was producing an "all time"
XML file with no date filtering applied.  It was shared with us
that, as this file grew each run, it was presenting problems
for Symplectic.

How this addresses that need:

The articles feed will now respect a new env var
'ARTICLES_PUBLISH_DAYS_PAST' that, when applied, will limit the
articles included in the output to only those with a PUBLISH_DATE
on or after that many days ago.  So if today is 2025-07-01, and
this env var is set for 90 days, it will limit to rows where
PUBLISH_DATE is >= 2025-04-01, roughly speaking.

This will allow our weekly and/or monthly runs of Carbon to set
a threshold like 90 days.  Each run will be significantly smaller,
likely around 5-10mb instead of 1gb.  There will be some overlap
in the XML file between runs, but that is just fine and was the
case prior.

Side effects of this change:
* The Articles XML output will be significantly smaller when
the env var is applied.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1333
@ghukill ghukill requested a review from a team July 2, 2025 14:04
@ghukill ghukill marked this pull request as ready for review July 2, 2025 14:04
@jonavellecuerdo jonavellecuerdo self-assigned this Jul 2, 2025
Comment on lines +194 to +195
@property
def query(self) -> Select: # type: ignore[override]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this into a @property to be consistent with ArticlesXmlFeed.

Comment on lines +115 to +116
@property
def query(self) -> Select: # type: ignore[override]
Copy link
Contributor Author

@ghukill ghukill Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving into a @property allows for two things:

  1. a bit of logic during construction
  2. instantiation of a Config() object after the testing harness and env vars are setup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what do you mean by "testing harness"? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, could have been more clear.

We set env vars in conftest.py as part of the _test_env fixture, but these aren't set until after imports take place in our files.

Originally, I had a:

config = Config()

at the top of feed.py, but this failed because the "testing harness" -- which includes the fixtures and generally anything else you'd expect to be "ready" for testing -- was not fully ready, and so the required env vars weren't set.

It worked locally when I had env vars set, and it would have worked in prod where they are also set, but not for testing.

Comment on lines +167 to +169
Note: the use of SQLite for testing makes it quite difficult to test date filtering
of the 'm/d/yyyy' format for PUBLISH_DATE found in the data warehouse. This test
confirms that the Oracle SQL looks as expected.
Copy link
Contributor Author

@ghukill ghukill Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note the "Note"!

I could have expanded more in the test note, but this is because SQLite -- which is used for testing -- doesn't have a TO_DATE() function or SYSDATE constant, both of which are used in the updated Oracle query.

Suffice to say I spent some cycles on ways around this.... and from my POV it just wasn't worth the complexity to simulate the query when we can test that the query object is updated per the presence of this env var.

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! The proposed solution addresses the request by Symplectic to reduce the size of the Articles feed XML file with minimally invasive changes to the Carbon application. Ran the instructions and worked as described!

@ghukill ghukill merged commit 1f62af0 into main Jul 2, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants