Add 'skip_last_n_seconds' config parameter for incremental replication #1

solteszad · 2023-08-01T14:38:21Z

Problem

Describe the problem your PR is trying to solve

updated_at in Hanami models are generated by Ruby code (i.e. Time.now)
There is a gap between when Time.now is executed and the UPDATE/INSERT SQL is executed.
When multiple Ruby processes for different linux servers actively insert/update rows, the clock in each server may run out-of-sync.

When elt-run job is replicating the rows, and multiple concurrent processes from source applications are busy update/insert the rows.
There is a high chance that, some rows will be actually inserted/updated slightly later, but with earlier timestamps. OR inserted/updated slightly earlier, but with a future timestamp.
When the EL job is replicating these newly insert/updated rows, there is a high chance that some rows that inserted/updated slightly later, but with earlier timestamp will get ignored.

Proposed changes

Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request.
If it fixes a bug or resolves a feature request, be sure to link to that issue.

Added a new config value (skip_last_n_seconds) that is uses by incremental replication. If it's set, timestamp based incremental replication will skip the rows updated in the last n second.

Types of changes

What types of changes does your code introduce to PipelineWise?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Description above provides context of the change
I have added tests that prove my fix is effective or that my feature works
Unit tests for changes (not needed for documentation changes)
CI checks pass with my changes
Bumping version in setup.py is an individual PR and not mixed with feature or bugfix PRs
Commit message/PR title starts with [AP-NNNN] (if applicable. AP-NNNN = JIRA ID)
Branch name starts with AP-NNN (if applicable. AP-NNN = JIRA ID)
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions

sswander

very nice work looking into the python code and adding the changes! I have some comments 🙇

sswander · 2023-08-09T13:00:52Z

tests/unit/test_incremental.py

+            'limit': None,
+            'offset': None


is the offset testable with unit test?

my understanding after checking the unit test code is that without basically rewriting the test suite it's not testable, and i don't really feel the energy to learn python unit testing and creating a whole new test suite for this feature 😆

got it, I'm guessing we don't run unit test in CI/CD for this repo as well, right? in that case even running it does not give a good benefit. But we probably have to be more careful in QA testing our stuff then

sswander · 2023-08-09T13:03:47Z

tap_postgres/sync_strategies/incremental.py

+    elif replication_key_sql_datatype.startswith('timestamp'):
+        offset = f' - interval \'{params["offset"]} seconds\''
+    else:
+        offset = f' - {params["offset"]}'


I'm not really clear on this

It can be used by integer ID based replication as well, in that case it will import the last offset number of rows before the latest replicated id.

do you have an example? meaning if we do incrementing serial primary key, we put a value like "1000" and if the last ID is 9999 we will only replicate up to ID 8999?

If the above is correct, I think maybe the term offset can be avoided, because it's very similar to pagination terminologies (limit and offset) yet behaves differently, especially the interval one.

There are also 2 units here (interval uses seconds, and Incrementing ID uses count) and they are both just expressed as number (offset: N). This could be confusing, maybe we can separate to two different config?

skip_last_n_rows: 1000 skip_last_n_seconds: 900

i think i'll skip implementing skip_last_n_rows in that case, as we don't really need it right now, let's have only one new config value.

sounds good! 👍

sswander

LGTM!

sswander · 2023-10-20T03:53:26Z

@solteszad while reviewing the Missing raw txn RFC I realise this PR is still open haha, should we merge it?

khoaanguyenn · 2024-04-10T04:39:09Z

@solteszad We have new pull-request from Quoc #2, would you mind merging this pull-request to main whilst Quoc's PR is under reviewing ?

solteszad · 2024-04-10T08:18:21Z

@solteszad We have new pull-request from Quoc #2, would you mind merging this pull-request to main whilst Quoc's PR is under reviewing ?

right 👍

added offset config parameter for incremental replication

1e7a489

solteszad self-assigned this Aug 1, 2023

fix bug

ebee6ac

solteszad marked this pull request as ready for review August 4, 2023 08:56

solteszad requested review from khoaanguyenn and teohm August 8, 2023 09:34

solteszad added the ready for review label Aug 8, 2023

solteszad requested a review from sswander August 8, 2023 16:23

sswander reviewed Aug 9, 2023

View reviewed changes

sswander added question Further information is requested and removed ready for review labels Aug 9, 2023

skip last n

83c62a6

solteszad changed the title ~~Add 'offset' config parameter for incremental replication~~ Add 'skip_last_n_seconds' config parameter for incremental replication Aug 10, 2023

solteszad added 4 commits August 10, 2023 13:51

bugfix

5a612d4

bugfix2

1ffe0d4

bugfix2

0a248fc

bugfix

edf960b

sswander removed the question Further information is requested label Aug 10, 2023

sswander approved these changes Aug 10, 2023

View reviewed changes

solteszad merged commit 2aec86b into master Apr 10, 2024

Add 'skip_last_n_seconds' config parameter for incremental replication #1

Add 'skip_last_n_seconds' config parameter for incremental replication #1

Uh oh!

Conversation

solteszad commented Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Proposed changes

Types of changes

Checklist

Uh oh!

sswander left a comment

Choose a reason for hiding this comment

Uh oh!

sswander Aug 9, 2023

Choose a reason for hiding this comment

Uh oh!

solteszad Aug 10, 2023

Choose a reason for hiding this comment

Uh oh!

sswander Aug 10, 2023

Choose a reason for hiding this comment

Uh oh!

sswander Aug 9, 2023

Choose a reason for hiding this comment

Uh oh!

sswander Aug 9, 2023

Choose a reason for hiding this comment

Uh oh!

solteszad Aug 10, 2023

Choose a reason for hiding this comment

Uh oh!

sswander Aug 10, 2023

Choose a reason for hiding this comment

Uh oh!

sswander left a comment

Choose a reason for hiding this comment

Uh oh!

sswander commented Oct 20, 2023

Uh oh!

khoaanguyenn commented Apr 10, 2024

Uh oh!

solteszad commented Apr 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

solteszad commented Aug 1, 2023 •

edited

Loading