Algorithms & Data Structures : Static Hashing

14 min readJun 13, 2021

Overview

Introduction
Static Hashing
-- Hash Tables
-- Hash Functions
---- Division
---- Mid-Square
---- Folding
---- Digit Analysis
---- Converting Keys to Integers
-- Overflow Handling
---- Open Addressing
---- Chaining
-- Theoretical Evaluation of Overflow Techniques

Introduction

Binary Search Tree 最糟可以在 O(N) 的執行時間下進行搜尋，插入元素，以及刪除元素。而 Balanced Binary Search Tree 可以加速到 O(logN)，而有個技巧可以在元素個數沒有誇張地多的時候，實現更快速的 O(1) 搜尋、插入元素，以及刪除元素，這便是 Hashing 。

Static Hashing

Hash Tables

在 static hash 裡，我們的 key 會使用 Hash Function 轉換城 Bucket 的編號。一個 Bucket 可以用 1 到多個的 Slot。

還是找不到的話，就往原 bucket 的下 ²² = 4 個 bucket 找空位。Definition :

key density : the ratio n/T, where n is the number of pairs in the table and T is the total number of possible keys.

n : 我們想放進 hash table 的鍵 : 值對總數。

T : 所有 key 的可能性。假設我們定義 key 是一個 3 個字母組成的短字串，則 key 的所有可能性為 26 + 26 * 26 + 26 * 26 * 26。

通常 T 的值會相當的大，導致 key density 的值相當的小。且 bucket 的數量相對於 T 的值來說，也是相當的小，這導致我們有可能將不同的 key 分配到同一個 bucket。

Synonyms : Two keys, k1 and k2, are said to be synonyms with respect to hash_function if hash_function(k1) == hash_function(k2).

兩個 key 經由 hash_function 計算後，是要放在同一個 bucket ，那這兩個 key 就是同義詞( Synonyms )。

Loading density : n / sb

n : 我們想放進 hash table 的鍵 : 值對總數。

s : hash table 的 Slot 個數

b : hash table 的 bucket 個數

該值越大，發生碰撞 ( collision ) 的機率越高

overflow : 當我們想把一對 key : value 塞進一個 Slot 已經用完的 Bucket，則 overflow 就會發生。

collision : 當我們想把一對 key : value 塞進一個非空的 bucket，則 collision 會發生。

Example

這裡以圖(一)舉一個例子，來運用上述介紹的名詞。

我們想要插入 5 對鍵值對 : [ABC, 1], [ACB, 2], [AAA, 3], [BBB, 4], [CCC, 5]
→ n = 5

Slot 個數為 2

Bucket 個數為 26

我們定義一個 key 是由三個字母組成
→ T = 26 + 26 * 26 + 26 * 26 * 26

key density = n / T = 5 / (26 + 26 * 26 + 26 * 26 * 26)

loading density = n / sb = 5 / 52

hash_function : 第一個字母相同的 key ，會被放進同一個 bucket。

鍵值 ABC 以及 ACB 會被放進同一個 Bucket ，於是 ABC, ACB 互為 Synonyms。

在我們放入鍵值對 [ABC, 1] 後，再試著放入 [ACB, 2]，因為被分配到的 Bucket 裡已經存在 [ABC, 1] ，所以 Bucket 非空，這時候會發生碰撞 ( collision )。

在我們放入鍵值對 [ABC, 1], [ACB, 2] 後，再試著放入 [AAA, 3]，這時候因為被分配到的 Bucket 已經用完所有 Slot 了，於是會發生 overflow 。

當 overflow 發生時，就只能含淚把這個資料丟掉了嗎？看來等等需要想個辦法處理 overflow 的情況。

Hash Table 厲害的一點就是，插入，刪除，搜尋的動作跟我們要處理的資料總量 ( 也就是上述名詞介紹裡的 n ) 脫鉤了，不像 Binary Search Tree 、Heap等等，計算速度跟資料總量有緊密的關係。

Hash Table 的速度取決於 Hash Function 的計算速度，以及每一個 Bucket 裡的 Slot 數量。而要怎麼在一個 Bucket 裡搜尋 Slot 可以取決於怎麼實作，簡單的話就使用 sequential search （一個個遍尋）。

Hash Function 最理想的狀態下，是可以快速計算，且碰撞(collision)的機率越小越好。

Overflow 是怎常發生的情況，必須想些辦法來處理 overflow 。

Hash Functions

Division

hash_function( kay ) = kay % D

這代表我們的 bucket 數量需要 b >= D。

在現實世界中的應用中， overflow 的機率很大一部份跟 D 有關。雖然理論上，上面的 hash_function 可以讓每一個數字分到某個特定 bucket 的機率相同 ( uniform hash function )，但真實世界的應用可能不會那麼簡單。

假如我們簡單地把 D =2 ，則偶數跑去 bucket 0，奇數跑去 bucket 1。在真實世界的應用下，很容易有大部分的 key 為奇數，或大部分的數為偶數的情況。

根據 Reference 2 所述， D 的質因數越大越好，或甚至將 hash table 的 bucket 數量設為某個質數，並取其為 D ( 使 D 為質數 )。

Mid-Square

將 key 取其 2 次方後，取其中間幾個 bit。例如 hash table 的 bucket 數量為 8，則取其中間 3 個 bit。

Folding

Shift Folding : 一開始可以選定 x 個位數，並每 x 位數取出，並相加。
舉個例子：123456789102，且設 x = 3
則 P1 = 123, P2 = 456, P3 = 789, P4 = 102
結果 hash_function (123456789102) = P1 + P2 + P3 + P4 = 1470

Folding at the boundaries : 上述那個方法，在偶數的部份將數倒轉就是了。
舉個例子：P2' = 654, P4' = 201
結果 hash_function(123456789102) = P1 + P2' + P3 + P4' = 1767

Digital Analysis

可用在我們已經預先知道所有 key 的可能性了，並對這組 key 設置一個有效的 hash_function。

這部份我找不太到那麼多資料，所以只是猜測。

例如我們的 hash table 的大小為 8，所以我們只需要取 3 個 bit。那我們要取哪幾個 bit 呢? 因為我們已經知道所有 key 的可能性了，這時候我們可以先取第 0, 1, 2 個 bit，並進行計算。接下來試試看取第 0, 5, 6 個 bit。實驗到最後，發覺取 4, 7, 8 個 bit 的 collision 機率最小，此時我們就可以選取第 4, 7, 8 個 bit 作為 hash_function。

Converting Keys to Integers

有時候鍵值不一定會是非負整數，有時候可能是一個字串。這時候為求方便，可以先將字串轉換成非負整數。

方法 1: 簡單的把所有 char 值加起來

unsigned int stringToInt(char *key) {
  int number = 0;
  while(*key) {
    number += *key++;
  }
}

方法 2 : 也可以做些偏移後再加在一起

unsigned int stringToInt(char *key) {
  int number = 0;
  while(*key) {
    number += *key++;
    if(*key) number += ((int) *key++) << 8;
  }
}

Overflow Handling

有兩種很熱門的 overflow handling 技巧，其中一個是 Open Addressing，另一個就是 chaining 了。而 Open Address 又能分為許多種類，例如下面會稍微介紹的 linear probing ( linear open addressing)，quadratic probing, rehashing, random probing。

Open Addressing : Linear Probing

Linear Probing 可以簡單地用上面的算式表示。ht 代表 hash table, h 代表 hash function，b 代表該 hash table 的 bucket 數量。當 h(k) 算出的 bucket 已經沒有位子了，就再試試看 (h(k) + 1) % b 的地方能不能放入我們的鍵值對，假如還是不行的話，就試試看 (h(k) + 2) % b。。。以此類推。

簡易的範例程式碼：

typedef struct element element;
struct element {
  int key;
  int data;
};
element *search(int k) {
  // 此 hash table 只有一個 slot
  // 在此 hash table 找不到此 key，就回傳 NULL
  // 找到的話，就回傳指向此鍵值對的指標
  int homeBucket, currentBucket;
  homeBucket = h(k);
  for ( currentBucket = homeBucket; ht[currentBucket] && ht[currentBucket]->key != k;) {
    currentBucket = (currentBucket + 1) % b; 
    if(currentBucket == homeBucket) {
      return NULL;
    }
  }
  if (ht[currentBucket]->key == k) {
    return ht[currentBucket];
  }
  return NULL;
}

要注意到，這種 overflow handling 方式雖然總比較次數的期望值可能很低，但 worst-case 的次數有可能很高。

Open Addressing : quadratic probing

或是

在 Linear Probing 時，當我們發生了 overflow ，就會往 hash table 原 bucket 的下 1 個 bucket 找空位。還是找不到的話，就往原 bucket 的下 2 個 bucket 找空位。還是找不到的話，就往原 bucket 的下 3個 bucket 找空位。。。以此類推。

而 quadratic probing 就是當 overflow 發生時，就會往 hash table 原 bucket 的下 1^2 = 1 個 bucket 找空位。還是找不到的話，就往原 bucket 的下 2^2 = 4 個 bucket 找空位。還是找不到的話，就往原 bucket 的下 3^2 = 9 個 bucket 找空位。。。以此類推。

Open Addressing : rehashing

準備很多個 hash function : h_1, h_2, h_3, ... h_n，當第一個 hash function 發生 overflow 後，再使用第二個 hash function。。。以此類推。

Open Addressing : random probing

s(i) 的意思是一個 pseudo random number generator( 虛擬亂數產生器 )，每當 overflow 發生時，我們就用虛擬亂數產生器來看下次要嘗試哪一個 bucket。

Chaining

把所有同義詞 ( synonyms ) 通通放在同一個 bucket。看是要用 linked list 串起來 { worst-case : O(n) }, 或是用 balanced binary search tree { worst-case : O(logn) }。

假若是用 linked list 將同一個 bucket 內的鍵值對串起來，並且使用 uniform hash function ( 使用 hash function 計算某個隨機鍵值對的 bucket 編號時，每一個 bucket 被分配到的機率相等，則此 hash function 便是 uniform hash function )，則總比對次數的期望值 ( the expected average number of key comparisons for a successful search ) 為

一個簡單的 linked list 的 chaining 範例:

typedef struct element element;
struct element {
  int key;
  int data;
  element* link;
};
element* search(int k) {
  element *current;
  int homeBucket = h(k);
  for ( current = ht[homeBucket]; current;current = current->link ) {
  if(current->key == k) {
    return current;
  }
  return NULL;
}

Theoretical Evaluation of Overflow Techniques

這邊簡單介紹如何推導出 uniform hash function + chaining 理論上的平均所需比較次數。

U_n :
the expected number of key comparisons when a search is made for a key not in the hash table.
當我們在一張 hash table 裡想搜尋一個鍵值對，並且找不到時，所需比較次數的期望值

S_n :
the average number of comparisons needed to find the jth key k_j, averaged over 1 ≤ j ≤ n, with each j equally likely, and averaged over all b^n hash sequences, assuming each of these also to be equally likely.
我們在 hash table 裡有一群鍵值對，共 n 個。當我們想要查詢所有 n 個鍵值對所需比較次數的期望值。

Theorem :

Proof:
首先來證明 U_n。U_n 除了可以解釋成 “ 當我們在一張 hash table 裡想搜尋一個鍵值對，並且找不到時，所需比較次數的期望值 ” ，也可以解釋成 “每個 bucket 的 chain 內鍵值對個數的期望值 ”，因為所需比較次數就是 chain 的長度。

而 “每個 bucket 的 chain 內鍵值對個數的期望值 ” 也就是 loading density，所以