Skip to content

Conversation

@quic-meetkuma
Copy link
Contributor

@quic-meetkuma quic-meetkuma commented Nov 21, 2025

  • Added a logger which will log onto console and file. This code is similar to existing QEff. Finetuning logger code.
  • Also added dist_utils which serves as utility code when dealing with distributed training.
  • Added logger test cases for sanity checks.

TODO: Enable test cases via jenkins infra.

assert "Rank zero message" in caplog.text

@patch("QEfficient.finetune.experimental.core.logger.get_rank")
def test_log_rank_zero_not_zero(self, mock_get_rank, caplog):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a typo in the func name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update it to
test_log_rank_zero_negative_case


@patch("QEfficient.finetune.experimental.core.logger.get_rank")
def test_log_rank_zero_not_zero(self, mock_get_rank, caplog):
"""Test that non-rank zero messages are not logged"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, is it non-zero rank messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the description.

@quic-akuruvil
Copy link
Contributor

Can we also add a small example script as part of documentation may be, which helps with usage examples for the logger.

return dist.is_available() and dist.is_initialized()


def get_rank() -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work fine in case of PP + DDP? Currently, we use os.getenv("LOCAL_RANK", 0) to retrieve the rank in QEff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When training on multiple clusters of machines, with each having multiple devices, dist.get_rank() --> gives os.environ["RANK"] which is a global value across nodes and devices.
For us that wont be a problem as long as we dont do multi machine training, because for single machine training LOCAL_RANK = RANK.

For sake of clarity, implemented get_local_rank() as well and we will be internally using the same wherever we are intending to refer local rank 0.

The change will be reflected in latest.

@quic-meetkuma
Copy link
Contributor Author

Can we also add a small example script as part of documentation may be, which helps with usage examples for the logger.

I will add some sample commented text in this PR which will give user hint on how to use the logger. Later on add the same in an extended manner to the documentation as well.

@quic-akuruvil
Copy link
Contributor

Can we also add a small example script as part of documentation may be, which helps with usage examples for the logger.

I will add some sample commented text in this PR which will give user hint on how to use the logger. Later on add the same in an extended manner to the documentation as well.

Yes that would be helpful, thanks.

…s utility code when dealing with distributed training.

Signed-off-by: meetkuma <[email protected]>
Signed-off-by: meetkuma <[email protected]>
Signed-off-by: meetkuma <[email protected]>
@quic-meetkuma
Copy link
Contributor Author

Discarding this PR in favor of #644

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants