Skip to content

Fix tornado server (request) duration metric calculation #3679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

devmonkey22
Copy link
Contributor

@devmonkey22 devmonkey22 commented Aug 5, 2025

Description

Fix the tornado instrumenter to track a request's elapsed time in an async-safe way so concurrent requests calculate their own elapsed time for the HTTP_SERVER_DURATION metric properly. This changes the tornado instrumentation to track state (like request start time used to calculate request duration) on the per-request handler instance and not on the more global server_histogram object that is shared between requests.

Fixes #3486

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Added tornado metrics test_metrics_concurrent_requests test

Without the fix, the test run errors with:
image

Does This PR Require a Core Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@devmonkey22 devmonkey22 changed the title [WIP] Fix tornado server (request) duration metric calculation Fix tornado server (request) duration metric calculation Aug 5, 2025
@tammy-baylis-swi
Copy link
Contributor

@devmonkey22
Copy link
Contributor Author

Thank you for this fix. Please could you also add/update the tests?: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/74536f1a92a357c9a25fccba057ce5766d5a8f27/instrumentation/opentelemetry-instrumentation-tornado/tests/test_metrics_instrumentation.py

@tammy-baylis-swi Can you think of a good way to test this race condition? I think we'd have to be able start two concurrent fetches, where one hits prepare, stays running in a (maybe 1 second sleep), the 2nd request hits (updates start_time), then both requests can complete. Then we could test that the duration for the 1st request is similar to client duration (ie: not super short like 2nd request should be). Do you know of an example already? Otherwise, I was just going to rely on the existing tests to make sure duration is still calculated and similar to client duration (just can't check race condition is fully fixed).

@tammy-baylis-swi
Copy link
Contributor

@tammy-baylis-swi Can you think of a good way to test this race condition? I think we'd have to be able start two concurrent fetches, where one hits prepare, stays running in a (maybe 1 second sleep), the 2nd request hits (updates start_time), then both requests can complete. Then we could test that the duration for the 1st request is similar to client duration (ie: not super short like 2nd request should be). Do you know of an example already? Otherwise, I was just going to rely on the existing tests to make sure duration is still calculated and similar to client duration (just can't check race condition is fully fixed).

I'm not actually familiar with Tornado but generally the checks for values resulting from concurrent fetches with predictable times would be helpful, if it's possible. Hmm. Would it make sense to make_app with any new routes as needed, then also use tornado.httpclient.AsyncHTTPClient to do two concurrent fetches and check resulting metrics in memory?

@devmonkey22
Copy link
Contributor Author

@tammy-baylis-swi Can you think of a good way to test this race condition? I think we'd have to be able start two concurrent fetches, where one hits prepare, stays running in a (maybe 1 second sleep), the 2nd request hits (updates start_time), then both requests can complete. Then we could test that the duration for the 1st request is similar to client duration (ie: not super short like 2nd request should be). Do you know of an example already? Otherwise, I was just going to rely on the existing tests to make sure duration is still calculated and similar to client duration (just can't check race condition is fully fixed).

I'm not actually familiar with Tornado but generally the checks for values resulting from concurrent fetches with predictable times would be helpful, if it's possible. Hmm. Would it make sense to make_app with any new routes as needed, then also use tornado.httpclient.AsyncHTTPClient to do two concurrent fetches and check resulting metrics in memory?

Thanks @tammy-baylis-swi I think I found the test approach. Just need to iron out CI test failures that didn't happen locally.

@devmonkey22
Copy link
Contributor Author

Should be ready for review again. Tests are passing.

Copy link
Contributor

@tammy-baylis-swi tammy-baylis-swi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you for the commented new tests and updated OP. This lgtm; the Maintainers will also have to have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tornado instrumentation HTTP_SERVER_DURATION metric is inaccurate and not async-safe
3 participants