Data on active nodes is collected by crawling the entire Bitcoin P2P network. The process begins with the crawler obtaining an initial set of nodes from Bitcoin’s DNS seeds. The figures below detail the steps involved.
|DNS seed node request
|DNS seed node reply
|Crawler requesting nodes from DNS seeds
|Crawler receiving nodes from DNS seeds
On the left, the crawler is shown sending DNS requests to Bitcoin’s DNS seeds. The right figure shows the seeds replying with DNS messages that contain addresses of Bitcoin nodes. For details on this bootstrapping process, see DNS seed methodology.
The bulk of the data collection process entails processing the pending node set, which involves randomly removing nodes to be evaluated from the set as well as adding newly discovered nodes to the set.
Once a node has been selected for evaluation, the crawler attempts to open a connection to it. If the connection attempt fails, the node is marked as unreachable and evaluation ends; if it succeeds, the node is marked as reachable and the crawler carries out a node handshake with the node as illustrated in the figures below.
|Node handshake (Part 1)
|Node handshake (Part 2)
|Crawler initiating handshake
|Node completing handshake
The left figure shows the crawler initiating the handshake by sending a
version message to the node being evaluated. The right figure shows the node completing the handshake by replying with a
version message of its own. The crawler records all pertinent information included in the
version message sent by the node, including the services supported by the node (e.g., full vs. pruned node, non-SegWit vs. SegWit node, etc.), user agent, latest block observed, relay status, and so on. Additionally, the crawler records metadata including, among others, the node’s network address, the time when the connection to the node was created and how long it took to establish the connection.
In case the handshake was successful, the crawler submits a node advertisement request to the evaluated node and waits for a reply. The figures below detail the involved steps.
|Node advertisement request
|Node advertisement reply
getaddr message to request node advertisement from node
addr message(s) containing node addresses to crawler
In left figure, the crawler sends a
getaddr message to the evaluated node to request a node advertisement. The right figure shows the node replying with an
addr message that contains the node advertisement as well as the crawler adding the advertised nodes to its set of pending nodes. In practice, nodes typically reply with multiple
addr messages which in sum advertise around one thousand nodes. Each
addr message contains a list of node addresses, along with a last seen timestamp for each address which indicates when the advertising node was last connected to the advertised node.
Each advertised node is processed by the crawler as follows. Nodes already marked as reachable or unreachable are discarded. So are nodes whose age (i.e., the difference of the current time and the node’s last seen time stamp) exceeds 48 hours. The rationale for applying this age threshold is driven by empirical data, which indicates that the probability of a node being reachable drops exponentially with its age (see here for details). Filtering old nodes thus avoids unnecessarily increasing the crawler’s run time caused by waiting for timeouts of connection attempts to unreachable nodes. The remaining advertised nodes are added to the pending node set, and statistics on the advertisement (including, among others, the number of advertised peers, their network types and last seen timestamps) are recorded, thus concluding the node’s evaluation. The evaluation process then starts anew with another node from the pending node set and continues until the set is empty.