Address Clustering

What you'll learn on this page: how analysts group thousands of Bitcoin addresses into one "wallet", what makes the process fragile, and how to read the clusters txray builds without taking them too seriously.

The intuition

One Bitcoin address is almost useless as an identity. Real wallets generate a fresh address every time they receive money, so a single person might own hundreds or thousands of addresses over a year.

Clustering is the process of stitching those addresses back together. If you can group addresses by likely owner, you stop looking at individual transactions and start looking at whole wallets moving through time. That is where the interesting patterns live.

The catch is that clustering is fundamentally a chain of guesses. Each guess is probabilistic, and guesses compound. A cluster built on ten strong links is solid. A cluster built on ten weak links is fiction.

A worked example

Imagine three addresses that would look unrelated at first glance:

bc1q...aaa   seen in 2023
bc1q...bbb   seen in 2024
bc1q...ccc   seen in 2024

Then we find two transactions:

TX1:  INPUTS = [aaa, bbb]    OUTPUTS = [ddd, change_1]
TX2:  INPUTS = [bbb, ccc]    OUTPUTS = [eee, change_2]

In TX1, the common-input-ownership heuristic links aaa and bbb. In TX2, it links bbb and ccc. Because bbb appears in both clusters, all three addresses collapse into one group.

Now add change detection. If change_1 and change_2 are confidently identified as change outputs, they get absorbed into the same cluster. In a few hops, one transaction pair has pulled five addresses into a single ownership group.

Scale that up across millions of transactions and you get an address-clustering graph.

How graph clustering works

At a high level, the algorithm is simple.

Treat every address as a node in a graph.
For each transaction, run the per-transaction heuristics (CIOH, change detection, fingerprinting).
Add an edge between addresses whenever a heuristic links them, and attach a weight based on the confidence of the link.
Walk the graph and collapse strongly connected components into clusters.
Attach every cluster a provenance record: why each edge exists, not just that it exists.

txray cares a lot about step 5. Most commercial clustering tools give you the cluster without the reasoning. That is exactly the thing you need if you ever want to argue about whether a cluster is right.

How to read a cluster

Treat a cluster the same way you would treat a police sketch: useful direction, not identification.

Start from the strongest edges. A single-source two-input transaction with a high-confidence change detection is a much better anchor than a 40-input CoinJoin-adjacent blob.
Spread outwards one hop at a time. Re-evaluate the confidence after every hop. Clusters get softer the further you travel from the anchor.
Look for contradiction. If a cluster contains addresses whose wallet fingerprints disagree (one looks like Bitcoin Core, another looks like a Coinbase custody wallet), something is wrong. txray will flag this.
Prefer small confident clusters over big ambitious ones. A 6-address group you trust is more useful than a 600-address group you do not.

Why clustering is fragile

Several real-world patterns create false edges.

CoinJoin rounds. The whole point of CoinJoin is to fire the CIOH heuristic wrongly. Any cluster that crosses a CoinJoin boundary without a caveat is suspect.
Exchange custody. Exchanges run huge shared wallets. A single exchange address can collapse thousands of unrelated users into one cluster if you are not careful.
Payment processors. Services that batch many customer payments into one transaction produce shapes that look like consolidation but are not.
Multisig custody. Collaborative spending from a shared vault is a real link, but to a vault, not a person.

txray keeps an explicit ambiguity_reasons list on every cluster so you always know what the algorithm is unsure about.