Hash Tables
Source: Skiena
Hashing and Strings
Hash tables are a very practical way to maintain a dictionary. They exploit the fact that looking an item up in an array takes constant time once you have its index.
Hash function
A hash function is a mathematical function that maps keys to integers.
We will use the value of our hash function as an index into an array, and store our item at that position.
The first step of the hash function is usually to map each key to a big integer. Let α be the size of the alphabet on which a given string S is written. Let char(c) be a function that maps each symbol of the alphabet to a unique integer from 0 to α − 1. The function
|S|−1
H(S) = ∑α|S|−(i+1) × char(si)
i=0
maps each string to a unique (but large) integer by treating the characters of the string as “digits” in a base-α number system.
The result is unique identifier numbers, but they are so large they will quickly exceed the number of slots in our hash table (denoted by m). We must reduce this number to an integer between 0 and m−1, by taking the remainder of H(S) mod m. This works on the same principle as a roulette wheel. The ball travels a long distance around and around the circumference-m wheel H(S)/m times before settling down to a random bin. If the table size is selected with enough finesse (ideally m is a large prime not too close to 2i − 1), the resulting hash values should be fairly uniformly distributed.
Collision Resolution
No matter how good our hash function is, we had better be prepared for collisions, because two distinct keys will occasionally hash to the same value.
Chaining is the easiest approach to collision resolution.
Represent the hash table as an array of m linked lists. The ith list will contain all the items that hash to the value of i. Thus search, insertion, and deletion reduce to the corresponding problem in linked lists. If the n keys are distributed uniformly in a table, each list will contain roughly n/m elements, making them a constant size when m ≈ n.
Chaining is very natural, but devotes a considerable amount of memory to pointers. This is space that could be used to make the table larger, and hence the “lists” smaller.
The alternative is something called open addressing.
The hash table is maintained as an array of elements (not buckets), each initialized to null. On an insertion, we check to see if the desired position is empty. If so, we insert it. If not, we must find some other place to insert it instead. The simplest possibility (called sequential probing) inserts the item in the next open spot in the table. If the table is not too full, the contiguous runs of items should be fairly small, hence this location should be only a few slots from its intended position.
Searching for a given key now involves going to the appropriate hash value and checking to see if the item there is the one we want. If so, return it. Otherwise we must keep checking through the length of the run.
Deletion in an open addressing scheme can get ugly, since removing one element might break a chain of insertions, making some elements inaccessible. We have no alternative but to reinsert all the items in the run following the new hole.
Chaining and open addressing both require O(m) to initialize an m-element hash table to null elements prior to the first insertion. Traversing all the elements in the table takes O(n + m) time for chaining, because we have to scan all m buckets looking for elements, even if the actual number of inserted items is small. This reduces to O(m) time for open addressing, since n must be at most m.
When using chaining with doubly-linked lists to resolve collisions in an m-element hash table, the dictionary operations for n items can be implemented in the following expected and worst case times:
| Hash table (expected) | Hash table (worst case) | |
|---|---|---|
| Search(L, k) | O(n/m) | O(n) |
| Insert(L, x) | O(1) | O(1) |
| Delete(L, x) | O(1) | O(1) |
| Successor(L, x) | O(n + m) | O(n + m) |
| Predecessor(L, x) | O(n + m) | O(n + m) |
| Minimum(L) | O(n + m) | O(n + m) |
| Maximum(L) | O(n + m) | O(n + m) |
Pragmatically, a hash table is often the best data structure to maintain a dictionary. However, the applications of hashing go far beyond dictionaries.
Efficient String Matching via Hashing
Strings are sequences of characters where the order of the characters matters, since ALGORITHM is different than LOGARITHM. Text strings are fundamental to a host of computing applications, from programming language parsing/compilation, to web search engines, to biological sequence analysis.
The primary data structure for representing strings is an array of characters. This allows us constant-time access to the ith character of the string. Some auxiliary information must be maintained to mark the end of the string - either a special end-of-string character or (perhaps more usefully) a count of the n characters in the string.
The most fundamental operation on text strings is substring search, namely:
- Problem: Substring Pattern Matching
- Input: A text string t and a pattern string p.
- Output: Does it contain the pattern p as a substring, and if so where?
The simplest algorithm to search for the presence of pattern string p in text t overlays the pattern string at every position in the text, and checks whether every pattern character matches the corresponding text character. This runs in O(nm) time, where n = |t| and m = |p|.
This quadratic bound is worst-case. More complicated, worst-case linear-time search algorithms do exist. But here we give a linear expected-time algorithm for string matching, called the Rabin-Karp algorithm. It is based on hashing. Suppose we compute a given hash function on both the pattern string p and the m-character substring starting from the ith position of t. If these two strings are identical, clearly the resulting hash values must be the same. If the two strings are different, the hash values will almost certainly be different. These false positives should be so rare that we can easily spend the O(m) time it takes to explicitly check the identity of two strings whenever the hash values agree.
This reduces string matching to n−m+2 hash value computations (the n−m+1 windows of t, plus one hash of p), plus what should be a very small number of O(m) time verification steps. The catch is that it takes O(m) time to compute a hash function on an m-character string, and O(n) such computations seems to leave us with an O(mn) algorithm again.
But let’s look more closely at our previously defined hash function, applied to the m characters starting from the jth position of string S:
m−1
H(S, j) = ∑αm−(i+1) × char(si+j )
i=0
What changes if we now try to compute H(S, j + 1) - the hash of the next window of m characters? Note that m−1 characters are the same in both windows, although this differs by one in the number of times they are multiplied by α. A little algebra reveals that
H(S, j + 1) = α(H(S, j) − αm−1char(sj )) + char(sj+m)
This means that once we know the hash value from the j position, we can find the hash value from the (j + 1)st position for the cost of two multiplications, one addition, and one subtraction. This can be done in constant time (the value of αm−1 can be computed once and used for all hash value computations). This math works even if we compute H(S, j) mod M , where M is a reasonably large prime number, thus keeping the size of our hash values small (at most M ) even when the pattern string is long.
Rabin-Karp is a good example of a randomized algorithm (if we pick M in some random way). We get no guarantee the algorithm runs in O(n+m) time, because we may get unlucky and have the hash values regularly collide with spurious matches. Still, the odds are heavily in our favor - if the hash function returns values uniformly from 0 to M − 1, the probability of a false collision should be 1/M . This is quite reasonable: if M ≈ n, there should only be one false collision per string, and if M ≈ nk for k ≥ 2, the odds are great we will never see any false collisions.
Duplicate Detection Via Hashing
The key idea of hashing is to represent a large object (be it a key, a string, or a substring) using a single number. The goal is a representation of the large object by an entity that can be manipulated in constant time, such that it is relatively unlikely that two different large objects map to the same value.
Hashing has a variety of clever applications beyond just speeding up search. I once heard Udi Manber - then Chief Scientist at Yahoo - talk about the algorithms employed at his company. The three most important algorithms at Yahoo, he said, were hashing, hashing, and hashing.
Consider the following problems with nice hashing solutions:
-
Is a given document different from all the rest in a large corpus? – A search engine with a huge database of n documents spiders yet another webpage. How can it tell whether this adds something new to add to the database, or is just a duplicate page that exists elsewhere on the Web?
Explicitly comparing the new document D to all n documents is hopelessly inefficient for a large corpus. But we can hash D to an integer, and compare it to the hash codes of the rest of the corpus. Only when there is a collision is D a possible duplicate. Since we expect few spurious collisions, we can explicitly compare the few documents sharing the exact hash code with little effort.
-
Is part of this document plagiarized from a document in a large corpus? – A lazy student copies a portion of a Web document into their term paper. “The Web is a big place,” he smirks. “How will anyone ever find which one?”
This is a more difficult problem than the previous application. Adding, deleting, or changing even one character from a document will completely change its hash code. Thus the hash codes produced in the previous application cannot help for this more general problem.
However, we could build a hash table of all overlapping windows (substrings) of length w in all the documents in the corpus. Whenever there is a match of hash codes, there is likely a common substring of length w between the two documents, which can then be further investigated. We should choose w to be long enough so such a co-occurrence is very unlikely to happen by chance.
The biggest downside of this scheme is that the size of the hash table becomes as large as the documents themselves. Retaining a small but well-chosen subset of these hash codes (say those which are exact multiples of 100) for each document leaves us likely to detect sufficiently long duplicate strings.
-
How can I convince you that a file isn’t changed? – In a closed-bid auction, each party submits their bid in secret before the announced deadline. If you knew what the other parties were bidding, you could arrange to bid $1 more than the highest opponent and walk off with the prize as cheaply as possible. Thus the “right” auction strategy is to hack into the computer containing the bids just prior to the deadline, read the bids, and then magically emerge the winner.
How can this be prevented? What if everyone submits a hash code of their actual bid prior to the deadline, and then submits the full bid after the dead- line? The auctioneer will pick the largest full bid, but checks to make sure the hash code matches that submitted prior to the deadline. Such cryptographic hashing methods provide a way to ensure that the file you give me today is the same as original, because any changes to the file will result in changing the hash code.
Although the worst-case bounds on anything involving hashing are dismal, with a proper hash function we can confidently expect good behavior. Hashing is a fundamental idea in randomized algorithms, yielding linear expected-time algorithms for problems otherwise Θ(n log n), or Θ(n2) in the worst case.
Hashmap vs. Hash Table: Understanding the Differences
https://ethans.co.in/blogs/hashmap-vs-hash-table-understanding-the-differences/
Introduction
Hashing is a technique used in computer science to store and retrieve data in a fast and efficient manner. Hashing involves converting a key into an index in an array, where the data associated with that key can be stored. The two most commonly used hash-based data structures are hash table and hashmap. Both these data structures have their own unique set of features and trade-offs.
Both hash tables and hash maps are valuable data structures that can be used to efficiently store and retrieve key-value pairs. While they have some similarities, they also have some key differences that set them apart. When deciding which to use, it is important to consider the specific requirements of your use case.
What is a HashMap?
A HashMap is a data structure that is used to store key-value pairs. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. A HashMap is also known as a hash map, map, dictionary, or associative array. In a HashMap, keys are unique, but values can be duplicated.
What is a Hash Table?
A Hash Table is a data structure that uses a hash function to map keys to index values in an array. It stores key-value pairs in a bucket, where the index of the bucket is determined by the hash code of the key. Hash Tables are also known as hash maps, maps, dictionaries, or associative arrays. In a Hash Table, keys are unique, but values can be duplicated.
Differences Between HashMaps and Hash Tables
Data Structure
NOTE: This may not be exactly accurate. The difference between the chaining can be in both HashMaps and HashTables. HashTables can use both Chaining and Bucketing forms for resolving collisions. See the explanation from Skiena above.
The main difference between HashMaps and Hash Tables is their underlying data structure. HashMaps are implemented using an array of linked lists, where each element in the array is a linked list of key-value pairs. Hash Tables, on the other hand, use an array of buckets, where each bucket is a key-value pair.
Performance
When it comes to performance, both HashMaps and Hash Tables offer constant-time search, insertion, and deletion operations on average. However, the performance of both structures can degrade in the case of collisions. In HashMaps, collisions are handled by chaining, where multiple key-value pairs are stored in the same bucket using a linked list. In Hash Tables, collisions are handled by open addressing, where a different bucket is selected for the key-value pair using a secondary hash function.
Collision Handling
As mentioned earlier, both HashMaps and Hash Tables can suffer from collisions. The main difference in collision handling lies in the method used to resolve them. In HashMaps, collisions are resolved using chaining, where multiple key-value pairs are stored in the same bucket using a linked list.
In Hash Tables, collisions are resolved using open addressing, where a different bucket is selected for the key-value pair using a secondary hash function.
Order of Keys
In a hash table, the keys are stored in sorted order, which means that iteration over the keys will always return them in the same order. In contrast, the keys in a hash map are not stored in any particular order.
Memory Consumption
Hash tables typically require more memory than hash maps, as each bucket needs to store both a key and a value. In contrast, hash maps only store the value in an array, which results in less memory consumption.
Thread Safety
Hash tables are typically thread-safe, which means that they can be accessed by multiple threads concurrently without causing data corruption. In contrast, hash maps are not inherently thread-safe, and additional synchronization is required to prevent data corruption.
When to Use Hash Tables vs Hash Maps
Use Hash Tables When:
- You need the keys to be stored in sorted order
- You need thread-safe access to the data structure
Use Hash Maps When:
- You need faster performance
- You have memory constraints
- You don’t require the keys to be stored in sorted order