Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Bitcoin P2P network data

Overview

This website provides various empirical data on Bitcoin’s P2P network in the hope it might be useful in areas such as:

  • Quantification of P2P network health
  • Anomaly detection
  • Gaining insights in into the P2P network
  • Enabling data-driven P2P protocol improvements

Questions or feedback? Reach out via email virtu at cryptic dot to !

Open source

All data-collection tools as well as the collected data itself are released under an open-source license to maximize, among others, community benefits such as:

  • Opening data-collection methodology up to scrutiny
  • Robust and reliable data collection through peer review
  • Effortless increase in the scope of data collection
  • Generation of new insights through independent analysis

Comprehensive

Apart from open sourcing methodology, tools and data, this project’s key value is the broad scope of data collection.

In addition to data volunteered by nodes during the handshake, metadata (such as, e.g., timestamps, connection times, etc.) and data obtained by additional communication (e.g., getaddr replies) are retained as well.

This additional data can be interested in itself but it can also improve the accuracy of other results. Consider, for example, the two figures below.

Node age histogram Tor connection time histogram
Node age histogram Tor connection time histogram

In the histogram on the left, nodes advertised in addr messages are sorted into buckets based on their age In the context of node advertisements, node age is defined as the difference between the time a node advertisement was received and the timestamp included in the node advertisement, which corresponds to the time the node was last seen by the peer sending the advertisement. and categorized by whether they were reachable or not. The data is useful as such because it could be used to tune Bitcoin Core’s addrman horizon to minimize the probability of advertising unreachable nodes. In addition, the data highlights how small deviations in an input variable, in this instance the node age In the context of node advertisements, node age is defined as the difference between the time a node advertisement was received and the timestamp included in the node advertisement, which corresponds to the time the node was last seen by the peer sending the advertisement. threshold, can lead to large deviations in observed data (explaining how the number of reachable nodes provided on different websites can differ by hundreds or even thousands of nodes, even if the data cover identical network types). Moreover, in the context of data collection, the data helps selecting an appropriate node age In the context of node advertisements, node age is defined as the difference between the time a node advertisement was received and the timestamp included in the node advertisement, which corresponds to the time the node was last seen by the peer sending the advertisement. threshold by exposing the variable’s influence on result accuracy and stability.

The chart on the right shows a histogram of the times required to establish connections to each of the Tor nodes discovered during data collection. The data is inherently useful because it puts a quantitative price tag on reduced connection performance, the toll that must be paid in exchange for Tor’s privacy benefits. In the context of data collection, connection time data is indispensable for selecting appropriate timeouts for each network type. The lack of such data up until now might explain the low default timeout values chosen by some other open-source data collection tools, which skew their results to significantly underestimate the number of Tor nodes.

Acknowledgment

Special thanks to Spiral whose grant made this work possible.