Skip to content

small memory leak for new connections/subscriptions #732

@oliver-sanders

Description

@oliver-sanders

A small amount of memory (~1kb) is allocated with new connections or subscriptions (haven't asserted which yet) which doesn't appear to be released when the connection / subscription terminates.

The amount of memory is very small so is unlikely to be noticed in real-world situations. However, if for some reason, a client were to time out and reconnect with a high frequency, this can lead to a slow memory leak.

Important

Recommend investigating with cylc/cylc-flow#6985 merged in.

Steps To Reproduce

Configure a frequent websocket ping with a completely unreasonable timeout:

c.ServerApp.websocket_ping_interval = 3
c.ServerApp.websocket_ping_timeout = 0.000001

You will need to hack jupyter_server to allow you to use float values for the ping_timeout, here's the patch (will raise as a PR to jupyter_server in due course):

diff --git a/jupyter_server/serverapp.py b/jupyter_server/serverapp.py
index 1afbef4d0..53ecc04a7 100644
--- a/jupyter_server/serverapp.py
+++ b/jupyter_server/serverapp.py
@@ -1649,7 +1649,7 @@ class ServerApp(JupyterApp):
             return "jupyter_server.gateway.connections.GatewayWebSocketConnection"
         return ZMQChannelsWebsocketConnection
 
-    websocket_ping_interval = Integer(
+    websocket_ping_interval = Float(
         config=True,
         help="""
             Configure the websocket ping interval in seconds.
@@ -1666,12 +1666,16 @@ class ServerApp(JupyterApp):
             the connection will be closed from the server side.
         """,
     )
-    websocket_ping_timeout = Integer(
+    websocket_ping_timeout = Float(
         config=True,
         help="""
             Configure the websocket ping timeout in seconds.
 
-            See ``websocket_ping_interval`` for details.
+            The default is the same as the ping interval. Set to ``0`` to turn
+            the ping timeout off.
+
+            Note the the websocket ping timeout requires the ping interval to
+            be set, see ``websocket_ping_interval`` for details.
         """,
     )

This will cause the client to reconnect every ~3 seconds.

# launch the server
cylc gui --new

# connect the client in a new brower tab

# wait for the client to connect, then attach your profiler of choice
mprof attach <pid>

# wait for a few connect/disconnect cycles to pass, then collect your results
mprof plot <pid>
Image

Profiling

Because the amount of memory leaked is so small, it is quite hard to trace.

The best approach I have come up with so far is to compare memory snapshots, the diff of which should contain the answer.

There are two tools I'm aware of which can do this:

  • tracemalloc.take_snapshot.compare_to(previous, 'lineno')
  • pympler.tracker.SummaryTracker.print_diff

Here's a patch which uses tracemalloc:

diff --git a/cylc/uiserver/graphql/tornado_ws.py b/cylc/uiserver/graphql/tornado_ws.py
index fd58032..de24895 100644
--- a/cylc/uiserver/graphql/tornado_ws.py
+++ b/cylc/uiserver/graphql/tornado_ws.py
@@ -23,6 +23,7 @@
 #
 # It has been evolved to suit and ported to graphql-core v3.
 
+import tracemalloc
 import asyncio
 from asyncio.queues import QueueEmpty
 from contextlib import suppress
@@ -46,6 +47,8 @@ from cylc.flow.network.graphql_subscribe import subscribe
 from cylc.uiserver.authorise import AuthorizationMiddleware
 from cylc.uiserver.schema import SUB_RESOLVER_MAPPING
 
+tracemalloc.start()
+SNAP = tracemalloc.take_snapshot()
 
 NO_MSG_DELAY = 1.0
 
@@ -166,7 +169,7 @@ class TornadoSubscriptionServer:
     ):
         self.schema = schema
         self.loop = loop
-        self.middleware = middleware
+        self.middleware = middleware or []
         self.execution_context_class = execution_context_class
         self.auth = auth
 
@@ -310,6 +313,25 @@ class TornadoSubscriptionServer:
     async def on_start(self, connection_context, op_id, params):
         # Attempt to unsubscribe first in case we already have a subscription
         # with this id.
+        global SNAP
+        snapshot = tracemalloc.take_snapshot()
+        cmp = [
+            s
+            for s in snapshot.compare_to(SNAP, 'lineno')
+            if 'cylc/' in str(s)
+            # or 'tornado/' in str(s)
+            # or 'graphql/' in str(s)
+        ]
+        print()
+        print('+/-', sum(stat.size_diff for stat in cmp))
+        # with open('log', 'a') as log_file:
+        #     log_file.write(str(sum(stat.size_diff for stat in cmp)) + '\n')
+        #     log_file.flush()
+        for stat in sorted(cmp, key=lambda x: x.size_diff, reverse=True):
+            if stat.size_diff != 0:
+                print(f'  {stat}')
+        print()
+        SNAP = snapshot
         await connection_context.unsubscribe(op_id)
 
         params['kwargs']['root_value'] = op_id

This diff is filtering memory allocated from within the Cylc codabase.

Note that memory diffs can be a bit jumpy, sometimes memory doesn't get deallocated for a while, but does eventually.

Remove the filter and uncomment the with open('log bit to get the full memory diffs and log the net change to a file. Plot the result and you'll get something like this:

Image

It jumps around, the leak is barely visible in the long term trend.

Leaks

The leak may not be in the Cylc codebase at all.

So far, I have found one minor leak originating from Cylc, however, this doesn't seem to make a big difference: cylc/cylc-flow#6985

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions