I Have 500 Million Keys But What’s In My Redis DB?!?
Or: The Bigger They Are, The Harder They F(etch)all
While you could consider them a “Rich People Problems,” big databases do present big challenges. Scaling and performance are perhaps the most popularly-debated of these, but even trivial operations can become insurmountable as the size of your database grows. Recently, while working on a memory fragmentation issue with one of our users’ databases, I was reminded of that fact.
For my investigation of the fragmentation issue, I needed information about the database’s keys. Specifically, to understand what had led to the high fragmentation ratio that the database exhibited, I wanted to estimate the size of the keys’ values, their TTL and whether they were being used. The only problem was that that particular database had an excess of 500,000,000 keys. As such, iterating over all the keys to obtain the information I was looking for wasn’t a practical option. So, instead of using the brute force approach, I developed a little Python script that helped me quickly arrive at a good estimate of the data. Here’s a sample of the script’s output:
Instead of crunching the entire mountain of data, my script basically uses a small (definable) number of random samples to generate the data I needed (i.e. average data sizes, TTLs and so forth). While the script’s results aren’t as accurate as a fetch-and-process-all maneuver, it gave me the information I was looking for. You can find the script’s source immediately below and I hope you’ll find it useful.