-
Notifications
You must be signed in to change notification settings - Fork 0
Add 'skip_last_n_seconds' config parameter for incremental replication #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice work looking into the python code and adding the changes! I have some comments 🙇
tests/unit/test_incremental.py
Outdated
'limit': None, | ||
'offset': None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the offset testable with unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding after checking the unit test code is that without basically rewriting the test suite it's not testable, and i don't really feel the energy to learn python unit testing and creating a whole new test suite for this feature 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, I'm guessing we don't run unit test in CI/CD for this repo as well, right? in that case even running it does not give a good benefit. But we probably have to be more careful in QA testing our stuff then
elif replication_key_sql_datatype.startswith('timestamp'): | ||
offset = f' - interval \'{params["offset"]} seconds\'' | ||
else: | ||
offset = f' - {params["offset"]}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really clear on this
It can be used by integer ID based replication as well, in that case it will import the last offset number of rows before the latest replicated id.
do you have an example? meaning if we do incrementing serial primary key, we put a value like "1000" and if the last ID is 9999 we will only replicate up to ID 8999?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the above is correct, I think maybe the term offset
can be avoided, because it's very similar to pagination terminologies (limit and offset) yet behaves differently, especially the interval one.
There are also 2 units here (interval uses seconds
, and Incrementing ID uses count
) and they are both just expressed as number (offset: N
). This could be confusing, maybe we can separate to two different config?
skip_last_n_rows: 1000
skip_last_n_seconds: 900
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i'll skip implementing skip_last_n_rows
in that case, as we don't really need it right now, let's have only one new config value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@solteszad while reviewing the Missing raw txn RFC I realise this PR is still open haha, should we merge it? |
@solteszad We have new pull-request from Quoc #2, would you mind merging this pull-request to |
right 👍 |
Problem
Describe the problem your PR is trying to solve
updated_at
in Hanami models are generated by Ruby code (i.e.Time.now
)Time.now
is executed and the UPDATE/INSERT SQL is executed.When elt-run job is replicating the rows, and multiple concurrent processes from source applications are busy update/insert the rows.
There is a high chance that, some rows will be actually inserted/updated slightly later, but with earlier timestamps. OR inserted/updated slightly earlier, but with a future timestamp.
When the EL job is replicating these newly insert/updated rows, there is a high chance that some rows that inserted/updated slightly later, but with earlier timestamp will get ignored.
Proposed changes
Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request.
If it fixes a bug or resolves a feature request, be sure to link to that issue.
Added a new config value (
skip_last_n_seconds
) that is uses by incremental replication. If it's set, timestamp based incremental replication will skip the rows updated in the last n second.Types of changes
What types of changes does your code introduce to PipelineWise?
Put an
x
in the boxes that applyChecklist
setup.py
is an individual PR and not mixed with feature or bugfix PRs[AP-NNNN]
(if applicable. AP-NNNN = JIRA ID)AP-NNN
(if applicable. AP-NNN = JIRA ID)