7. Hashing - swchen1234/Algorithm GitHub Wiki

理论

著名hash算法

  • MD5, SHA-1/2, 可能都已过时,时间复杂度为 O(size(key))
  • Robin Karp:用sliding window的方式实现字符串查找

Hash 的实现

  • closed hashing(数组实现,若占位则顺延)=》 效率低
  • open hashing 由链表实现
  • hash饱和度 = 实际存储元素(size)/总共开辟空间大小(capacity), 一般 >= 1/10时就要rehash了
  • rehash效率低,建议一开始就设置足够大的capacity

String Hashing 的例子

def hash(string s):
sum = 0
for i in range(len(s)):
    sum = sum * 31 + int(s[i]) # 31 由经验得
    sum %= HASH_TABLE_SIZE
return sum

常见应用:

  1. Array one pass
  2. Hash每个字母或pattern
  3. Hash 2D矩阵
  4. 用Counter
  5. Design 类
  6. Sparse Vector

Problems

1. Array One Pass

1. Two Sum用dict存已访问的元素的val -> index
454. 4Sum IITwo Sum的延伸. 两个hash表中分别存a+b, 以及c+d
128. Longest Consecutive SequenceUse a hash set to store all values, and only check from the beginning of the consecutive sequence(the previous value is not in the set). For each such value, we search the set consecutive until we cannot find consecutive, then this is the length of subsequence starting at this value.
217. Contains Duplicate
290. Word Pattern 判断一个字符串与word list之间是否有一对应 => 2 hash map
219. Contains Duplicate II 128. Longest Consecutive Sequence化为set,对每个v, 若没有v-1,则v可能为左端点,while v in set: v+=1
1169. Invalid Transactions{name:{time:[city, idx]}}, 以及result set.
1554. Strings Differ by One Character用robin karp将每个单词hash成一个数字,从每一位开始,对每一个单词,判断去掉改字母后的hash值是否之前已经出现过
1590. Make Sum Divisible by P计算target_prefix_module = sum(nums)%p, 一次遍历,若(prefix + p - target_prefix_module)%p 已存在,则找到一个满足条件的subarray, 更新答案;hash[prefix] = i\

2. Hash每个字母或pattern

792. Number of Matching Subsequences用waiting存{startLetter:[remain part of each word to be matched]} 2062. Count Vowel Substrings of a String用lastSeen[vowel]=最近出现的位置,以及lastConsonant的位置, res += max(min(lastSeen.values()) - lastConsonant, 0)
249. Group Shifted Strings每个单词化为其字母的相对位置差(ord(s[i]-ord(s[i-1])+26)%26)
1072. Flip Columns For Maximum Number of Equal Rowshash每一行的pattern\

3. Hash 2D矩阵

36. Valid Sudoku对行、列、方块分布建立9*9 boolean hash
348. Design Tic-Tac-Toe同数独,row[], col[], diag, antiDiag中存放累积1或-1.
1074. Number of Submatrices That Sum to Target计算所有以左上角为顶点的presum, for row1, for row2, 化为找到row1和row2间的和为target的子矩阵(类Q560)\

4. 用Counter

350. Intersection of Two Arrays II
49. Group Anagrams对每个单词统计每个字母出现次数,将这26个数转为tuple,作为map的key
809. Expressive Words将s和word都化为[(c, cnt)]的形式(顺序重要), 再配对比较
825. Friends Of Appropriate Ages1)将年龄用counter合并,对相同年龄一一处理 2)sort[bisect范围最左,bisect最右]
1152. Analyze User Website Visit Patternhash[user]=[(t,city)], 按时间排序后,用set(combinations(cities,3))得到三个城市组合,放入freq中
3404. Count Special Subsequences化为找到pqrs,满足q /p = r /s, 遍历r { 令q = r - 2, 遍历p, ratio[nums[q]/nums[p]] += 1, 遍历 q, res += 满足条件的ratio cnt}, O(n^2)
916. Word Subsets将words2并为一个counter, words1中的每个word逐个与之比较\

5. Design 类

146. LRU Cache

需要支持(1)数据结构变化 => Linked List (2)判断是否存在 => Hash Map (3) 支持删除操作 解决方案:(1)双向链表 + hash{key:Node}, 其中key, value都存在list node中 (2)Singly LinkedList + hash{key:prev node},因为要支持删除操作。

460. LFU Cache可用dll做;但更方便的是存{key:(val, freq)}, {freq:OrderedDict(key,key)},其中OrderedDict可以确保按照least recently used排序
359. Logger Rate Limiter1)用dictionary of message, O(1)时间,O(m)空间。 2)当message很多时,用queue + set, 每次有new message, left pop出queue, 前端过时msg, 更新set;在判断新的msg是否在set中,决定是否加入O(n),O(n)
380. Insert Delete GetRandom O(1)记录list[], map[val]=index;删除时与尾部交换
1244. Design A Leaderboard存mp[player]=score, 每次topK建立min heap, O(nlgk)
1396. Design Underground System存check[id]=time, total[(start,end)]=(totalTime, totalCnt)]
710. Random Pick with Blacklist

产生randint in [0, n - len(blacklist)], 建立map[blacklist number]-> number(in [len(blk), n]), 这样如果随机数正好是blacklist的,则map到[len(blk), n];建立map时跳过[len(blk), n]中的黑名单数字。 1348. Tweet Counts Per Frequency提前设好start-endTime有几个slots,遍历time[tweetName]的时间,对应的time slot += 1\

6. Sparse Vector

1570. Dot Product of Two Sparse Vectors
311. Sparse Matrix Multiplication 用dict(dict)存compressed matrix. 对于每个mat1中的非零元素m1[i1][j1],for j2 in m2[j1]: res[i1][j2]+=v1*v2, O(mkn)\