搜索算法 - HongZhaoHua/jstarcraft-rns GitHub Wiki

搜索引擎教程


搜索引擎解决什么问题

专门解决搜索领域的海量结构化,半结构化,非结构化数据的实时搜索问题.

Lucene与Elasticsearch, Nutch, Solr之间的关系


搜索的分类

索引分为三类:

  • 基于树的索引(tree-based index)

  • 基于哈希的索引(hash-based index)

  • 基于倒排的索引(inverted-based index)

搜索结构化数据

索引原理:对列值排序存储,数据结构={列值,行地址}.利用二分查找找到行地址,再根据行地址获取行数据.

搜索非结构化数据

  • 顺序扫描

  • 倒排索引


搜索的过程

retrieval

索引过程(Index)

构建文档->分析文档->索引文档

检索过程(Retrieve)

构建查询->分析查询->检索查询

分析过程(Tokenize)

分词(Tokenization)->词元(Token)->语言处理->词项(Term)

对于英语,语言处理分为以下两步:
    将单词缩减为词根形式(drivers->driver),称为Stemming;
    将单词转变为词根形式(drove->drive),称为:Lemmatization;

Stemming和Lemmatization的异同:
* 相同之处
    Stemming和Lemmatization都要使词汇成为词根形式.
* 不同之处:
    Stemming基于规则(drivers->driver,driving->driver)
    Lemmatization基于词典(drove->drive,driving->driver)

文档与字段

名称 说明
IntPoint
FloatPoint
LongPoint
DoublePoint
BinaryDocValuesField 单值
SortedDocValues 单值
SortedSetDocValues 多值
NumericDocValuesField 单值
SortedNumericDocValuesField 多值
StringField
TextField
StoredField

分词

评估资源icwb2-data中包含了 (Academia Sinica, City University, Peking University, Microsoft Research)四家单位的训练集,测试集.

在统一环境下评估若干分词软件,使用的模型/词库为各分词软件自带.


索引


查询

测试查询

名称 说明
        // 全部匹配查询
        Query query = new MatchAllDocsQuery();
        TopDocs search = searcher.search(query, 1000000);
        Assert.assertEquals(1681, search.totalHits.value);
        // 全不匹配查询
        Query query = new MatchNoDocsQuery();
        TopDocs search = searcher.search(query, 1000);
        Assert.assertEquals(0, search.totalHits.value);

词项查询

名称 说明
        // 词项查询
        Query query = new TermQuery(new Term("title", "Toy"));
        TopDocs search = searcher.search(query, 1000);
        Assert.assertEquals(1, search.totalHits.value);
        // 范围查询
        Query query = new TermRangeQuery("title", new BytesRef("Toa"), new BytesRef("Toz"), true, true);
        TopDocs search = searcher.search(query, 1000);
        Assert.assertEquals(22, search.totalHits.value);
        // 前缀查询
        PrefixQuery query = new PrefixQuery(new Term("title", "Touc"));
        TopDocs search = searcher.search(query, 1000);
        Assert.assertEquals(2, search.totalHits.value);
        // 通配符查询
        {
            // *代表0个或者多个字母
            Query query = new WildcardQuery(new Term("title", "*ouc*"));
            TopDocs search = searcher.search(query, 1000);
            Assert.assertEquals(2, search.totalHits.value);
        }
        {
            // ?代表0个或者1个字母
            Query query = new WildcardQuery(new Term("title", "?ouc?"));
            TopDocs search = searcher.search(query, 1000);
            Assert.assertEquals(2, search.totalHits.value);
        }
        // 正则查询
        RegexpQuery query = new RegexpQuery(new Term("title", "To[a-z]"));
        TopDocs search = searcher.search(query, 1000);
        Assert.assertEquals(7, search.totalHits.value);
        // 模糊查询
        Query query = new FuzzyQuery(new Term("title", "Stori"));
        TopDocs search = searcher.search(query, 1000);
        Assert.assertEquals(32, search.totalHits.value);

短语查询

名称 说明
        // 短语查询
        // 设置短语之间的跨度为2,也就是说Story和The之间的短语小于等于均可检索
        PhraseQuery build = new PhraseQuery.Builder().setSlop(2).add(new Term("title", "Story")).add(new Term("title", "The")).build();
        TopDocs search = searcher.search(build, 1000);
        Assert.assertEquals(2, search.totalHits.value);
        // 多短语查询
        Term[] terms = new Term[] { new Term("title", "NeverEnding"), new Term("title", "Xinghua,") };
        Term term = new Term("title", "The");
        // add之间认为是OR操作,即"NeverEnding", "Xinghua,"和"The"之间的slop不大于3
        MultiPhraseQuery multiPhraseQuery = new MultiPhraseQuery.Builder().add(terms).add(term).setSlop(3).build();
        TopDocs search = searcher.search(multiPhraseQuery, 1000);
        Assert.assertEquals(2, search.totalHits.value);
        // 相当于TermQuery,区别是使用SpanTermQuery可以得到词项的跨度信息
        Query query = new SpanTermQuery(new Term("title", "Toy"));
        TopDocs search = searcher.search(query, 1000);
        Assert.assertEquals(1, search.totalHits.value);
        // 临近查询(匹配域中[0,n]范围内的词项)
        SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Story"));
        SpanFirstQuery firstQuery = new SpanFirstQuery(spanQuery, 5);
        TopDocs search = searcher.search(firstQuery, 1000);
        Assert.assertEquals(5, search.totalHits.value);
        // 跨度查询
        SpanQuery[] spanQueries = new SpanQuery[] { new SpanTermQuery(new Term("title", "The")), new SpanTermQuery(new Term("title", "Story")) };
        {
            // 不考虑顺序的情况
            SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, false);
            TopDocs search = searcher.search(nearQuery, 1000);
            Assert.assertEquals(3, search.totalHits.value);
        }
        {
            // 考虑顺序的情况
            SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
            TopDocs search = searcher.search(nearQuery, 1000);
            Assert.assertEquals(1, search.totalHits.value);
        }
        {
            // 考虑间隔的情况
            {
                SpanNearQuery.Builder builder = SpanNearQuery.newOrderedNearQuery("title");
                builder.addClause(spanQueries[0]).addGap(2).setSlop(5).addClause(spanQueries[1]);
                SpanNearQuery nearQuery = builder.build();
                TopDocs search = searcher.search(nearQuery, 1000);
                Assert.assertEquals(1, search.totalHits.value);
            }
            {
                SpanNearQuery.Builder builder = SpanNearQuery.newOrderedNearQuery("title");
                builder.addClause(spanQueries[0]).addGap(3).setSlop(5).addClause(spanQueries[1]);
                SpanNearQuery nearQuery = builder.build();
                TopDocs search = searcher.search(nearQuery, 1000);
                Assert.assertEquals(0, search.totalHits.value);
            }
        }
        SpanQuery[] spanQueries = new SpanQuery[] { new SpanTermQuery(new Term("title", "The")), new SpanTermQuery(new Term("title", "Story")) };
        {
            SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Angels"));
            SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
            SpanNotQuery notQuery = new SpanNotQuery(nearQuery, spanQuery);
            TopDocs search = searcher.search(notQuery, 1000);
            Assert.assertEquals(1, search.totalHits.value);
        }
        {
            SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Day"));
            SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
            SpanNotQuery notQuery = new SpanNotQuery(nearQuery, spanQuery);
            TopDocs search = searcher.search(notQuery, 1000);
            Assert.assertEquals(0, search.totalHits.value);
        }
        SpanQuery[] spanQueries = new SpanQuery[] { new SpanTermQuery(new Term("title", "The")), new SpanTermQuery(new Term("title", "Story")) };
        SpanNotQuery leftQuery = null;
        {
            SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Angels"));
            SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
            leftQuery = new SpanNotQuery(nearQuery, spanQuery);
        }
        SpanNotQuery rightQuery = null;
        {
            SpanQuery spanQuery = new SpanTermQuery(new Term("title", "Day"));
            SpanNearQuery nearQuery = new SpanNearQuery(spanQueries, 5, true);
            rightQuery = new SpanNotQuery(nearQuery, spanQuery);
        }
        SpanOrQuery orQuery = new SpanOrQuery(new SpanQuery[] { leftQuery, rightQuery });
        TopDocs search = searcher.search(orQuery, 1000);
        Assert.assertEquals(1, search.totalHits.value);

数值查询

名称 说明
        // 精确查询
        Query exactQuery = IntPoint.newExactQuery("id", 1);
        TopDocs search = searcher.search(exactQuery, 1000);
        Assert.assertEquals(1, search.totalHits.value);
        // 范围查询
        Query rangeQuery = IntPoint.newRangeQuery("id", 501, 1000);
        TopDocs search = searcher.search(rangeQuery, 1000);
        Assert.assertEquals(500, search.totalHits.value);
        // 集合查询
        Query setQuery = IntPoint.newSetQuery("id", 1, 10, 100, 1000);
        TopDocs search = searcher.search(setQuery, 1000);
        Assert.assertEquals(4, search.totalHits.value);

组合查询

名称 说明

功能查询

名称 说明

解析查询

名称 说明