Does the AWS Clickstream solution take into account session_id stickiness? #1041
Replies: 1 comment 5 replies
-
| Clickstream does not support session stickiness out-of-box. This does not align with the design philosophies of the ingestion module. However, it is a high-throughput and reliable ingestion server that can handle hundreds of thousands of requests per second in our benchmark. It is SDK-agnostic and supports Clickstream SDKs, GTM, and other third-party SDKs. It's important to note that the session might not always be available in the data. Clickstream SDKs compress multiple events and then base64 encodes them by default before sending the data. However, this process may consume lots of compute resources and increase latencies if parsing the attribute of each event on the ingestion server. To process events per session in real-time, you can have a consumer process the events in KDS/MSK. This involves decoding, uncompressing, and extracting the session information of events. Once this is done, the events can be put into another downstream stream (KDS/MSK topic) for consumption by your business consumers. If you have a large volume of requests, MSK might be a cost-efficient option. If you want to customize the ingestion server, there is an option to forward events with the session serving as the partition key. If you are using Clickstream SDKs, you will also need to customize them to send events per batch in the same session. The complete source code, including the container image of the ingestion server, is available in this repository, and you can modify it to meet your specific requirements. | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have followed the implementation guide for the AWS clickstream solution, currently using Kinesis on demand, sinking to an S3 bucket (just following the steps in the setup). I'm not bothered with setting up data processing and analytics dashboards. I simply want to use the data ingestion module and consume from the stream (MSK or Kinesis) using custom built processing consumers.
Requirements: I need to spin up a fleet of consumers as I potentially will be dealing with huge volumes of data that requires parallelised consumption and processing of the data in the stream. All events that belong to a single session (example '_session_id' ) MUST end up in the same consumer.
If the events get randomly passed into different partitions or shards, then the session becomes split across multiple consumers which completely breaks my processing of the data stream. I know kinesis and Kafka support partition keys so that data arriving into the streams with the same partition key (ideally a session ID of some kind) are ensured to end up in the same partition and therefore end up in the same consumer. The documentation is not clear about how incoming events sent from the client application within a single "_session_id" (or some other identifier of a users events) are split up across partitions within within the data stream.
The clickstream solution seems designed to dump the data into an S3 bucket which only fine if I wasn't doing real-time processing of my data. I could simply sort my S3 data into sessions and do batch processing then, but this is not what I want.
Are there any experts on the AWS clickstream solution that can tell me if partition keys are taken into account in order to sort the session's events into partitions/shards? Or is the ingestion module designed to just dump all incoming data randomly into whatever partition/shard it wants? Session stickiness across partitions is an absolute requirement of my project.
EDIT: Maybe another related question is: If it does not partition the data by a session_id or other similar ID, is it possible to modify the solution to customise it? I'm guessing it would be clone the AWS clickstream analytics repo -> modify the source code for the vector server (the configuration toml files I'm assuming) to partition by session_id -> re-bootstrap the cdk, re-deploy the stack with the modified vector server configuration? Is this possible? Or are there components that the solution pulls from image repositories that are not created by the CDK and therefore non-modifiable? In other words, is the entire solution completely customisable by forking and modifying the GitHub repo?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions