Skip to content

Conversation

solteszad
Copy link

@solteszad solteszad commented Aug 1, 2023

Problem

Describe the problem your PR is trying to solve

  • updated_at in Hanami models are generated by Ruby code (i.e. Time.now)
  • There is a gap between when Time.now is executed and the UPDATE/INSERT SQL is executed.
  • When multiple Ruby processes for different linux servers actively insert/update rows, the clock in each server may run out-of-sync.

When elt-run job is replicating the rows, and multiple concurrent processes from source applications are busy update/insert the rows.
There is a high chance that, some rows will be actually inserted/updated slightly later, but with earlier timestamps. OR inserted/updated slightly earlier, but with a future timestamp.
When the EL job is replicating these newly insert/updated rows, there is a high chance that some rows that inserted/updated slightly later, but with earlier timestamp will get ignored.

Proposed changes

Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request.
If it fixes a bug or resolves a feature request, be sure to link to that issue.

Added a new config value (skip_last_n_seconds) that is uses by incremental replication. If it's set, timestamp based incremental replication will skip the rows updated in the last n second.

Types of changes

What types of changes does your code introduce to PipelineWise?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

  • Description above provides context of the change
  • I have added tests that prove my fix is effective or that my feature works
  • Unit tests for changes (not needed for documentation changes)
  • CI checks pass with my changes
  • Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
  • Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
  • Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions

@solteszad solteszad self-assigned this Aug 1, 2023
@solteszad solteszad marked this pull request as ready for review August 4, 2023 08:56
@solteszad solteszad requested review from khoaanguyenn and teohm August 8, 2023 09:34
@solteszad solteszad requested a review from sswander August 8, 2023 16:23
Copy link

@sswander sswander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice work looking into the python code and adding the changes! I have some comments 🙇

Comment on lines 33 to 34
'limit': None,
'offset': None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the offset testable with unit test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding after checking the unit test code is that without basically rewriting the test suite it's not testable, and i don't really feel the energy to learn python unit testing and creating a whole new test suite for this feature 😆

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, I'm guessing we don't run unit test in CI/CD for this repo as well, right? in that case even running it does not give a good benefit. But we probably have to be more careful in QA testing our stuff then

elif replication_key_sql_datatype.startswith('timestamp'):
offset = f' - interval \'{params["offset"]} seconds\''
else:
offset = f' - {params["offset"]}'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really clear on this

It can be used by integer ID based replication as well, in that case it will import the last offset number of rows before the latest replicated id.

do you have an example? meaning if we do incrementing serial primary key, we put a value like "1000" and if the last ID is 9999 we will only replicate up to ID 8999?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above is correct, I think maybe the term offset can be avoided, because it's very similar to pagination terminologies (limit and offset) yet behaves differently, especially the interval one.

There are also 2 units here (interval uses seconds, and Incrementing ID uses count) and they are both just expressed as number (offset: N). This could be confusing, maybe we can separate to two different config?

skip_last_n_rows: 1000
skip_last_n_seconds: 900

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i'll skip implementing skip_last_n_rows in that case, as we don't really need it right now, let's have only one new config value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! 👍

@sswander sswander added question Further information is requested and removed ready for review labels Aug 9, 2023
@solteszad solteszad changed the title Add 'offset' config parameter for incremental replication Add 'skip_last_n_seconds' config parameter for incremental replication Aug 10, 2023
@sswander sswander removed the question Further information is requested label Aug 10, 2023
Copy link

@sswander sswander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sswander
Copy link

@solteszad while reviewing the Missing raw txn RFC I realise this PR is still open haha, should we merge it?

@khoaanguyenn
Copy link

@solteszad We have new pull-request from Quoc #2, would you mind merging this pull-request to main whilst Quoc's PR is under reviewing ?

@solteszad
Copy link
Author

@solteszad We have new pull-request from Quoc #2, would you mind merging this pull-request to main whilst Quoc's PR is under reviewing ?

right 👍

@solteszad solteszad merged commit 2aec86b into master Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants