Skip to main content

Recent Network Performance Degradation Report

· 4 min read

On April 13, The Root Network (TRN) core development team began receiving reports of peer connections across the network dropping, impacting block to finalization time, network responsiveness, connectivity, and overall performance. The team attempted traditional methods to resolve the issue, such as clearing the Distributed Hash Table (DHT) to help restore network pairs, however, this proved ineffective.

Further investigation identified the root cause stemmed from ethy-gadget not handling the outdated messages correctly. Implementation of the solution began on April 18, with the final fix (version 7.52.0 ) rolling out on April 23, resolving and in fact improving performance issues.

In the process of resolving this issue, users experienced intermittent failures in their transactions. Approximately 100 transactions may have been affected.

Incident Description

The network performance degradation issue was caused by the inefficient pruning mechanism in the ethy-gadget, a core component of The Root Network’s cross-chain bridge protocol.

When there is a withdrawal request to bridge an asset from The Root Network to either Ethereum or XRPL networks, ethy-gadget starts to propagate a message across all nodes to seek consensus between validators. The result of that consensus process is a multi-sig proof, that can be cryptographically verified on the other side of the bridge. If consensus is not reached, the message should be marked as outdated and discarded.

This did not happen, resulting in outdated, non-completed messages continuously circulating within the network, subsequently leading to network congestion and instability.

The issue was further exacerbated by the mesh-like structure inherent in any blockchain network. When a node received a single problematic message, it broadcasted this message to all its peers, regardless of the number. These peers, in turn, broadcasted the message further, creating a rapid and widespread propagation of the faulty data across the network. Nodes see other nodes as bad actors for repeating the same messages and, as a result, refuse to connect with each other, causing peer connections to drop.

Solution

To solve these issues The Root Network core development team introduced an improved pruning mechanism for eth-gadget in client version 7.52.0. Each eth-gadget message now has a time-to-live value of 6 minutes, after which a message is considered completed and discarded. Details of the changelog can be found here.

This solution has been applied to all nodes on the network. eth-gadget messages are now propagated and discarded properly. A positive side effect of this update is node bandwidth usage has noticeably reduced, greatly increasing efficiency and reducing cost for those running a node.

Timeline of Events

April 13, 02:30 UTC

The number of peer connections across The Root Network nodes fell, from over 40 to single digits, impacting the networks ability to reach finality.

April 13, 03:30 UTC

Clearing the Distributed Hash Table (DHT) to help restore network pairs proved ineffective, core development team was alerted and started to investigate the cause.

April 14, 03:00 UTC

The network was experiencing gaps between the Best Block and Finalized Block, sometimes larger than 100 blocks.

April 14, 05:30 UTC

A new set of bootnodes was deployed to clear reputation scores and DHTs, and force all nodes to reserve healthier network peers and stabilizing the network temporarily, providing developers sufficient time to diagnose and address the underlying issues.

April 16, 07:00 UTC

Root cause identified: the core issue stemmed from ethy-gadget not handling outdated messages correctly.

April 16, 08:00 UTC

Transmission of ethy-gadget messages through four key bootnodes was disabled to regain network stability, at the cost of increasing failure rate of withdrawal transactions on the bridge. Concurrently, efforts were initiated to modify the handling mechanisms for ethy-gadget.

April 18, 00:00 UTC

Version one of the hotfix (fix/ethy-sol1) was released, with a change in ethy-gadget, to stop broadcasting outdated and duplicate messages. This adjustment significantly enhanced the peer reputations and subsequently restored network stability.

Issues related to the bridge remain unresolved.

April 23, 22:50 UTC

The final fix for ethy-gadget (version 7.52.0) was released. Including crucial updates to the pruning mechanism, incorporating block numbers in ethy-gadget messages, setting a 6-minute proof generation window, and enhancing message handling with updates to GossipValidators.

The update also optimizes data relevance by discarding outdated messages and expanding event cache capacity.

April 29, 23:00 UTC

All nodes updated to version 7.52.0.

April 30, 04:30 UTC

Bridge regression testing carried out, all components working as expected and network performance is back to normal.