In the last 4 months I’ve been working on how to implement a good hash tablefor OPIC (Object Persistence in C). During the development, I madea lot of experiments. Not only for getting better performance, but also knowingdeeper on what’s happening inside the hash table. Many of these findings arevery surprising and inspiring. Since my project is getting mature, I’d geta pause and start writing a hash table deep dive series. There was a lot offun while discovering these properties. Hope you enjoy it as I do.
OS: Mac OS X v10.5 Leopard or newer CPU: 1 GHz Intel Prozessor (Core 2 duo) or more RAM: 1 GB or more Equipment: Webcam (with iSight), Headset, fast internet connection Other: Latest version of QuickTime. Wash and dry hands throughly before application - Apply a small, pea-sized amount of Carpe to palms - Rub palms together vigorously for a minimum of 15 seconds - For best results, apply every night.
Same disclaimer. I now work at google, and this project(OPIC including the hash table implementation) is approved by googleInvention Assignment Review Committee as my personalproject. The work is done only in my spare time on my own machine,and does not use and/or reference any of the google internal resources.
Hash table is one of the most commonly used data structure. Most standardlibrary use chaining hash table, but there are more options inthe wild. In contrast to chaining, open addressing doesnot create a linked list on bucket with collision, it insert the itemto other bucket instead. By inserting the item to nearby bucket, openaddressing gains better cache locality and is proven to be faster in manybenchmarks. The action of searching through candidate buckets for insertion,look up, or deletion is known as probing. There are many probing strategies:linear probing, quadratic probing, double hashing, robinhood hasing, hopscotch hashing, and cuckoo hashing.Our first post is to examine and analyze the probe distribution among thesestrategies.
To write a good open addressing table, there are several factors to consider:1. load: load is the number of bucket occupied over the bucket capacity. The higher the load, the better the memory utilization is. However, higher load also means the probability to have collision is higher.2. probe numbers: the number of probes is the number of look up to reach the desired items. Regardless of cache efficiency, the lower the total probe count, the better the performance is.3. CPU cache hit and page fault: we can count both the cache hit and pagefault analytically and from cpu counters. I’ll write such analysis in laterpost.
Linear probing can be represented as a hash function of a key and aprobe number $h(k, i) = (h(k) + i) mod N$. Similarly, quadraticprobing is usually written as $h(k, i) = (h(k) + i^2) mod N$. Doublehashing is defined as $h(k, i) = (h1(k) + i cdot h2(k)) mod N$.
Quadratic probing is used by dense hash map. In my knowledgethis is the fastest hash map with wide adoption. Dense hash map setthe default maximum load to be 50%. Its table capacity is boundedto power of 2. Given a table size $2^n$, insert items $2^{n-1} + 1$,you can trigger a table expansion, and now the load is 25%. We canclaim that if user only insert and query items, the table load isalways within 25% and 50% (the table may need to expand at least once).
I implemented a generic hash table to simulate dense hashmap probing behaviors. Its performance is identical to dense hashmap. The major difference is I allow non power of 2 table size, seemy previous post for why the performance does not degrade.
I setup the test with 1M inserted items. Each test differs in its load(by adjusting the capacity) and probing strategies.Although hash table is O(1) on amortized look up, we’ll still hope theworst case not larger than O(log(N)), which is log(1M) = 20 in this case.Let’s first look at linear probing, quadraticprobing and double hashing under 30%, 40%, and 50% load.
This is a histogram of probe counts. The Y axis is log scale. One cansee that other than linear probing, most probes are below 15. Doublehashing gives us smallest probe counts, however each of the probe hashigh probability trigger a cpu cache miss, therefore is slower inpractice. Next, we look at these methods under high load.
The probe distribution now have a very high variance. Obviously, manyprobes exceeds the 20 threshold, some even reach 800.Linear probing, among the other methods, has very bad variance underhigh load. Quadratic probing is slightly better, but still have someprobes higher than 100. Double hashing still gives the best probestatistics. Below is the zoom in for each probe strategies:
The robin hood hashing heuristic is simple and clever. Whena collision occur, compare the two items’ probing count, the onewith larger probing number stays and the other continue to probe.Repeat until the probing item finds an empty spot. For more detailedanalysis checkout the original paper.Using this heuristic, we can reduce the variance dramatically.
The linear probing now have the worst case not larger than 50,quadratic probing has the worst case not larger than 10, anddouble hashing has the worst case not larger than 5! Althoughrobin hood hashing adds some extra cost on insert and deletion,but if your table is read heavy, it’s really suitable for the job.
From engineering perspective, the statistics are sufficient to makedesign decisions and move on to next steps (though, hopscotch andcuckoo hashing was not tested). That what I did 3 months ago. However,I could never stop asking why. How to explain the differences? Canwe model the distribution mathematically?
The analysis on linear probing can trace back to 1963 by Donald Knuth.(It was an unpublished memo dated July 22, 1963. With annotation “Myfirst analysis of an algorithm, originally done during Summer 1962 inMadison”). Later on the paper worth to read are:
Unfortunately, these research are super hard. Just linear probing (and itsrobin hood variant) is very challenging. Due to my poor survey ability, Iyet to find a good reference to explain what causes linear probing, quadraticprobing and double hashing differ on the probe distribution. Though buildinga full distribution model is hard, but creating a simpler one to convince myselfturns out is not too hard.
The main reason why linear probing (and probably quadratic probing) gets highprobe counts is rich get richer: if you have a big chunk of elements, theyare more likely to get hit; when they get hit, the size of the chunk grows,and it just get worse.
Let’s look at a simplified case. Say the hash table only have 5 items, and allthe items are in one consecutive block. What is the expected probing number forthe next inserted item?
See the linear probing example above. If the element get inserted to bucket 1,it has to probe for 5 times to reach the first empty bucket. (Here we start theprobe sequence from index 0; probe number = 0 means you inserted to an emptyspot without collision). The expectation probing number for next inserted itemis
For quadratic probing, you’ll have to look at each of the item and trackwhere it first probe outside of the block.
The expected probe number for next item in quadratic probing is$frac{3+2+2+2+1}{N} = frac{10}{N}$. Double hashing is the easiest:$1cdotfrac{5}{N}+2cdot(frac{5}{N})^2+3cdot(frac{5}{N})^3+cdots$If we only look at the first order (because N » 5), then we cansimplify it to $frac{5}{N}$.
The expected probe number of next item shows that linear probing isworse than other method, but not by too far. Next, let’s look atwhat is the probability for the block to grow.
To calculate the probability of the block to grow on next insert, wehave to account the two buckets which connected to the block. For linearprobing, the probability is $frac{5+2}{N}$. For quadratic probing, weadd the connected block, but we also have to remove the buckets whichwould jump out during the probe. For double hashing, the probabilityto grow the block has little to do with the size of the block, becauseyou only need to care the case where it inserted to the 2 connectedbuckets.
Using the same calculation, but making the block size as a variable,we can now visualize the block growth of linear probing, quadraticprobing, and double hashing.
This is not a very formal analysis. However, it gives us a sense of whythe rate of linear probing getting worse is way larger than the others.Not only knowing which one is better than the other, but also knowinghow much their differences are.
How about the robin hood variant of these three probing methods?Unfortunately, I wasn’t able to build a good model that can explainthe differences. A formal analysis on robin hood hashing using linearprobing were developed by Viola. I yet to find a good analysisfor applying robin hood on other probing method. If you find it, pleaseleave a comment!
Writing a (chaining) hash table to pass an interview is trivial, but writinga good one turns out to be very hard. The key for writing high performancesoftware, is stop guessing.
Measure, measure, and measure. Program elapsed time is just one of thesample point, and can be biased by many things. To understand theprogram runtime performance, we need to further look at programinternal statistics (like probe distribution in this article), cpucache misses, memory usage, page fault count, etc. Capture theinformation, and analyze it scientifically. This is the only way topush the program to its limit.
This my first article of “Learn hash table the hard way” series. Inthe following post I’ll present more angles on examining hash table performance.Hope you enjoy it!
At the beginning of 2021, ArenaNet announced that they would no longer be supporting the macOS client for Guild Wars 2. As a longtime player, I found this… unfortunate.
There are several ways to run the Windows client on a Mac including dual-booting Windows with Bootcamp, but that’s not an option for me. Running in a virtual machine like VMWare or Parallels is too slow, so that leaves some form of Wine which is a Windows compatibility layer. The old macOS 32-bit Guild Wars 2 client actually used a version of Wine to run. It wasn’t as good as when they released a 64-bit native client, but it worked for a time.
I tried several ways to run Guild Wars 2 using Wine before finding one that works for me. In this post I’ll explain how to set it up.