|
| 1 | +--- |
| 2 | +title: Undelivered messages |
| 3 | +--- |
| 4 | + |
| 5 | +When using FireFly in multiparty mode to deliver broadcast or private messages, one potential problem is that of |
| 6 | +undelivered messages. In general FireFly's message delivery service should be extremely reliable, but understanding |
| 7 | +when something has gone wrong (and how to recover) can be important for maintaining system health. |
| 8 | + |
| 9 | +## Background |
| 10 | + |
| 11 | +This guide assumes some familiarity with how |
| 12 | +[multiparty event sequencing](../architecture/multiparty_event_sequencing.md) works. |
| 13 | +In general, FireFly messages come in three varieties: |
| 14 | + |
| 15 | + |
| 16 | + |
| 17 | +1. **Unpinned private messages:** private messages delivered directly via data exchange |
| 18 | +2. **Pinned private messages:** private messages delivered via data exchange, with a hash of the message recorded on the blockchain ledger |
| 19 | +3. **Pinned broadcast messages:** messages stored in IPFS, with a hash and reference to the message shared |
| 20 | + |
| 21 | +All messages are batched for efficiency, but in cases of low throughput, you may frequently see batches |
| 22 | +containing exactly one message. |
| 23 | + |
| 24 | +"Pinned" messages are those that use the blockchain ledger for reliable timestamping and ordering. These messages have |
| 25 | +two pieces which must be received before the message can be processed: the **batch** is the actual contents of |
| 26 | +the message(s), and the **pin** is the lightweight blockchain transaction that records the existence and ordering of |
| 27 | +that batch. We frequently refer to this combination as a **batch-pin**. |
| 28 | + |
| 29 | +> Note: there is a fourth type of message denoted with the type "definition", used for things such as identitity claims |
| 30 | +> and advertisement of contract APIs. For most troubleshooting purposes these can be treated the same as pinned |
| 31 | +> broadcast messages, as they follow the same pattern (with only a few additional processings steps inside FireFly). |
| 32 | +
|
| 33 | +## Symptoms |
| 34 | + |
| 35 | +When some part of the multiparty messaging infrastructure requires troubleshooting, common symptoms include: |
| 36 | + |
| 37 | +- a message was sent, but is not present on some other node where it should have been received |
| 38 | +- a message is stuck indefinitely in "sent" or "pending" state |
| 39 | + |
| 40 | +## Troubleshooting steps |
| 41 | + |
| 42 | +When troubleshooting one of the symptoms above, the main goal is to identify the specific piece of the infrastructure that is |
| 43 | +experiencing an issue. This can lead you to diagnose specific issues such as misconfiguration, network problems, database |
| 44 | +integrity problems, or potential code bugs. |
| 45 | + |
| 46 | +In all cases, the **batch ID** is the most critical piece of data for determining the nature of the issue. You can usually |
| 47 | +retrieve the batch for a particular message by querying `/messages/<message-id>` and looking for the `batch` field in the returned |
| 48 | +response. In rare cases, if this is not populated, you can also retrieve the message transaction via `/messages/<message-id>/transaction`, |
| 49 | +and then you can use the transaction ID to query `/batches?tx.id=<transaction-id>`. |
| 50 | + |
| 51 | +The batch ID will be the same on all nodes involved in the messaging flow. Therefore, the following two steps can be |
| 52 | +easily performed to check for the existence of the expected items: |
| 53 | + |
| 54 | +- query `/batches/<batch-id>` on each node that should have the message |
| 55 | +- query `/pins?batch=<batch-id>` on each node that should have the message (for pinned messages only) |
| 56 | + |
| 57 | +Then choose one of these scenarios to focus in on an area of interest: |
| 58 | + |
| 59 | +#### 1) Is the batch missing on a node that should have received it? |
| 60 | + |
| 61 | +For private messages, this indicates a potential problem with **data exchange**. Check the sending node to see if the FireFly |
| 62 | +operations succeeded when sending the batch via data exchange, and check the data exchange logs for any issues processing it |
| 63 | +(the FireFly operation ID can be used to trace the operation through data exchange as well). |
| 64 | +If an operation failed on the sending node, you may need to retry it with `/operations/<op-id>/retry`. |
| 65 | + |
| 66 | +For broadcast messages, this indicates a potential problem with **IPFS**. Check the sending node to see if the FireFly |
| 67 | +operations succeeded when uploading the batch to IPFS, and the receiving node to see if the operations succeeded when |
| 68 | +downloading the batch from IPFS. If an operation failed, you may need to retry it with `/operations/<op-id>/retry`. |
| 69 | + |
| 70 | +#### 2) Is the batch present, but the pin is missing? |
| 71 | + |
| 72 | +This indicates a potential problem with the **blockchain connector**. Check if the underlying blockchain node is |
| 73 | +healthy and mining blocks. Check the sending FireFly node to see if the operation succeeded when pinning the batch via the |
| 74 | +blockchain. Check the blockchain connector logs (such as evmconnect or fabconnect) to see if it is |
| 75 | +successfully processing events from the blockchain, or if it is encountering any errors before forwarding those events |
| 76 | +on to FireFly. |
| 77 | + |
| 78 | +#### 3) Are the batch and pin both present, but the messages from the batch are still stuck in "sent" or "pending"? |
| 79 | + |
| 80 | +Check the pin details to see if it contains a field `"dispatched": true`. If this field is false or missing, it means |
| 81 | +that the pin was received but couldn't be matched successfully with the off-chain batch contents. Check the FireFly |
| 82 | +logs and search for the batch ID - likely this issue is in FireFly and it will have logged some problem while |
| 83 | +aggregating the batch-pin. In some cases, the FireFly logs may indicate that the pin could not be dispatched because |
| 84 | +it was "stuck" behind another pin on the same context - so you may need to follow the trail to a batch-pin for a |
| 85 | +different batch and determine why that earlier one was not processed (by starting over on this rubric |
| 86 | +and troubleshooting that batch). |
| 87 | + |
| 88 | +## Opening an issue |
| 89 | + |
| 90 | +It's possible that the above steps may lead to an obvious solution (such as recovering a crashed service or retrying a |
| 91 | +failed operation). If they do not, you can open an issue. The more detail you can include from the troubleshooting above |
| 92 | +(including the type of message, the nodes involved, and the details on the batch and pin found when examining each node), |
| 93 | +the more likely it is that someone can help to suggest additional troubleshooting. Full logs from FireFly, and (as |
| 94 | +deemed relevant from the troubleshooting above) full logs from the data exchange or blockchain connector runtimes, will |
| 95 | +also make it easier to offer additional insight. |
0 commit comments