Trie Tree - HolmesJJ/Data-Structures-and-Algorithms GitHub Wiki

入门链接

图解

详解

字典树(前缀树) 数据结构与算法：字典树（前缀树） Trie 树构造原理、应用场景与复杂度分析

标准写法

步骤1: 创建TrieNode

public class TrieNode {

    // 英文只有26个字母
    private final static int MAX_SIZE = 26;

    // 当前节点储存的字母
    private char data;

    // 单词是否完结
    private boolean isEnd;

    // 子树（数组形式，和BinaryTree不同，不一定只有两个叶结点）
    private TrieNode[] subtrees;

    // 以当前单词结尾的单词数量
    private int count;

    // 初始化一个结点
    public TrieNode() {
        subtrees = new TrieNode[26];
        isEnd = false;
        count = 0;
    }

    public static int getMaxSize() {
        return MAX_SIZE;
    }

    public char getData() {
        return data;
    }

    public void setData(char data) {
        this.data = data;
    }

    public boolean isEnd() {
        return isEnd;
    }

    public void setEnd(boolean end) {
        isEnd = end;
    }

    public TrieNode[] getSubtrees() {
        return subtrees;
    }

    public void setSubtrees(TrieNode[] subtrees) {
        this.subtrees = subtrees;
    }

    public int getCount() {
        return count;
    }

    public void setCount(int count) {
        this.count = count;
    }
}

步骤2: 创建TrieTree

public class TrieTree {

    // 插入一个新单词
    public void insert(TrieNode root, String str) {
        if (root == null || str.length() == 0) {
            return;
        }
        // 单词全部转成小写
        str = str.toLowerCase();
        // ASCII码 a ==> 97, b ==> 98
        // chars['a'] = a[97] 转成 chars['a' - 'a'] = a[97 - 97] = a[0]
        // 可以得到a -> 0, b -> 1, c -> 2... 避免浪费空间
        char[] chars = str.toCharArray();
        // 单词的每一个字母插入到一个TireNode中，这里避免使用递归
        for (char ch : chars) {
            // 把26个字母转成0-25之间的数字
            // a是97，b是98... 按顺序排列
            int loc = ch - 'a';
            // 若当前的TireNode是null，就插入
            if (root.getSubtrees()[loc] == null) {
                root.getSubtrees()[loc] = new TrieNode();
                root.getSubtrees()[loc].setData(ch);
            }
            // 每插入一个TireNode，就跳去改TireNode结点，去操作它的子结点
            root = root.getSubtrees()[loc];
        }
        // 遍历完成后，设置一个flag说明这是一个单词
        root.setEnd(true);
        // 以该节点结尾的单词数+1
        root.setCount(root.getCount() + 1);
    }

    // 查找该单词是否存在，如果存在返回数量，不存在返回-1
    public int find(TrieNode root, String str) {
        if(root == null || str.length()==0){
            return -1;
        }
        // 单词全部转成小写
        str = str.toLowerCase();
        // ASCII码 a ==> 97, b ==> 98
        // chars['a'] = a[97] 转成 chars['a' - 'a'] = a[97 - 97] = a[0]
        // 可以得到a -> 0, b -> 1, c -> 2... 避免浪费空间
        char[] chars = str.toCharArray();
        for (char ch : chars) {
            // 把26个字母转成0-25之间的数字
            // a是97，b是98... 按顺序排列
            int loc = ch - 'a';
            // 若当前结点不为空，代表该单词前面部分依然匹配
            if (root.getSubtrees()[loc] != null) {
                root = root.getSubtrees()[loc];
            }
            // 否则，该单词不匹配
            else {
                return -1;
            }
        }
        // 若当前结点是单词的终结，返回数量
        if (root.isEnd()) {
            return root.getCount();
        } else {
            return -1;
        }
    }
}

要点

假设所有字符串长度之和为n，即size of text文本大小，构建字典树的时间复杂度为O(n)
假设要查找的字符串长度为k，查找的时间复杂度为O(k)，若查找不到，甚至更快，因为匹配不下去就会直接返回false，（查找与字符串的数量无关，与文本的大小无关）
假设S为字符集（最大子结点数），L为字符串最长长度（最大层数），空间复杂度达到O(S^L)，上面的写法是26^L（26个英文字母），若经过优化没有浪费的数组空间，就是O(size of text)，或O((size of text)*overhead)，size of text代表文本大小，overhead代表每个结点中储存的指针，flag等额外数据，然而O(size of text)也是在文本中每个单词的重合度都很低的情况下才会达到
Ascii字符集: 256，即TrieNode children[] = new TrieNode[256];，非常耗费空间
树的degree度是固定的

与传统方法比较

传统方法的时间复杂度：比较两个字符串的时间复杂度是min(A.length, B.length)为L，该树的高度为h，总时间复杂度为O(hL)

适用场景

统计，查找搜索引擎中用于分词，词频统计（TF/IDF），自动补全机制，压缩算法等

字符串字典
- 搜索
- 排序/枚举字符串（每个子树数组都是排好序的）
部分字符串操作:
- 前序查询Prefix queries: 查找所有以pi开头的字符串
- 最长前缀Long prefix: "pickling"中最长的前缀是什么
- 通配符Wildcards: 找到形式为"pi??le"的字符串