Account Entropy

The following is a report of account entropy as of Saturday February 02, 2019 15:30 UTC.

1497352 XRP ledger accounts were analyzed for character and substring occurences. The results are detailed below.

The justification behind this study is to examing the current XRP Ledger account distribution by public text representation. We hope these results will be of assistance in understanding the ledger as it currently stands and devising new schemes and mechanisms which to robustly and conveniently reference and utilize accounts.

Note: this is an analysis of the Base58 encoding of XRP account addresses, not of the public key which is used as its basis. Due to the encoding process, this study is not used to make statements as to the distribution of generated XRP keys but rather an analysis of patterns found in the human representation, for use in tools and utility systems down the road.

The list of accounts analyzed can be retrieved here (37MB). The code used to analyze accounts is made available here.

The following is the XRP alphabet:

For analysis the leading 'r' was removed before analysis was removed as it is uniform accross all account ids.

This is the tally of number the occurances by each character in the complete account set:

Also the percentage each character is represented in the total account id set:

We can look at the distribution of characters for each individual character position in the account ID:

    Account IDs are of a standard 34 character length (including the leading 'r'), though not all accounts are of this length:

      In total, there are 50850633 characters in the account ID set resulting in 48MB of space needed to store it (if using 8bit characters).

      We start analyzing accounts for common substrings, the following are the most/least common 3, 5, and 7 character substrings in the list of ids with the number of occurances in the account set:

      Looking at the complete 3 character set we see 195111 combinations, consistent with the 583 combinations of XRP alphabet characters. Looking at the distribution of these sequences in the data expressed as number of standard deviations () from the mean ():

      With the 5 and 7 character substring analysis, we see approximately 40-42 million unique substrings in the data set, which is significantly less than the corresponding 585 and 587 total combinations. Also we see that the overlap of substring occurance is far less frequent with higher lengths. This can be explained through the greater number of permutations in the longer string length, many combinations of which have not yet been generated in the ledger

      Next we perform the same 3,5,7 character substring analysis but ignoring case. The following is the most/least common case-insensitive substrings with the corresponding number of occurances in the account set:

      Again plotting the distribution of the case-insensitive 3 character combinations in the data expressed as number of standard deviations () from the mean ():

      Conslusion/Next Steps:

      Because of its relative low overlap, the 5 character substring can be seen as a good mechanism to reference account IDs in an informal context (investigations, discussions pertaining to ledger activity, etc). The case-insensitive version of account ids should suffice to provide enough variance, though care should always be taken to ensure the accounts in a set being analyzed/discussed can truley be distinguished by the fixed-length substring set being used. The case-sensitive versions afford additional variance at the expense of additional referential complexity.

      This work can be extended to incorporate a algorthimic generation of "human-friendly" account mnemonics based on ID parsing cross-referenced with a db of substrings rated to be of more significance. Perhaps this could be a community effort driven by a public resource where these substrings can be voted upon and ranked.