Lucene - cllu/.rc GitHub Wiki

This article illustrate the usage of Lucene, a popular and mature package for information retrieval. I use it heavily in my research, so this article is biased to modifying it to use as research tools.

The codes are written in Scala.

How it works

Both indexing and retrieval relies on the Analyzer, which will do parsing(text/XML/HTML/PDF etc.), tokenization, stemming etc. Another thing is the Similarity, which will collect term statistics during indexing.

val luceneVersion = Version.LUCENE_46
val analyzer = new EnglishAnalyzer(luceneVersion)
val similarity = new LMDirichletSimilarity(200)

Another common variable is the index location of class Directory:

val indexDirectory:Directory = new NIOFSDirectory(new File('index'))
// or create a in-memory index(do not use it for huge indexes)
//val indexDirectory = new RAMDirectory()

Indexing

val indexWriterConfig = new IndexWriterConfig(luceneVersion, analyzer)
val indexWriter = new IndexWriter(indexDirectory, indexWriterConfig)
val doc = new Document()
val idField = new StringField("_id", "007", Field.Store.YES)
val textField = new TextField("text", "example text", Field.Store.NO)
indexWriter.addDocument(doc)

Query Parsing

Lucene provides buit-in query parsing functions. However, here we illustrate the process by building a simple parser.

def parseQuery(analyzer: Analyzer, text: String): Query = {
  // build the query
  val query = new BooleanQuery()
  val tokenStream = analyzer.tokenStream("text", text)
  val termAtt = tokenStream.addAttribute(classOf[CharTermAttribute])
  tokenStream.reset()
  while (tokenStream.incrementToken()) {
    val token = termAtt.toString
    // remove punctuations
    if (token.length > 1 && !moreStopwords.contains(token)) {
      val term = new Term("text", token)
      query.add(new TermQuery(term), BooleanClause.Occur.SHOULD)
    }
  }
  tokenStream.close()

  query
}

Retrieval

The retrieval is performed by the IndexSearcher.search() function.

val reader = DirectoryReader.open(indexDirectory)
val searcher = new IndexSearcher(reader)
searcher.setSimilarity(similarity)
val collector = TopScoreDocCollector.create(numResultsPerQuery, true)
val query = parseQuery(analyzer, "example query")
searcher.search(query, collector)
val hits = collector.topDocs().scoreDocs
for (hit <- hits) {
    val score = hit.score
    val docId = searcher.doc(hit.doc).getField("_id").stringValue()
    println(s"$docId")
}

Cleanup

The indexDirectory should be closed.

indexDirectory.close()

Internals

Similarity

The Similarity class provides information about normalization and weighting.

At index time, indexer will call computeNorm(FieldInvertState) method to store a per-document value for the field, which can be retrieved using AtomicReader#getNormValues(String).

At query time,

Before query, computeWeight(float, CollectionStatistics, TermStatistics...) is called to compute collection-level statistics, encoded in a Similarity.SimWeight object.
When analyzing queries, Similarity.SimWeight#getValueForNormalization() is called for query leaf node, and Similarity#queryNorm(float) is called for the top-level query node. These values are used in Similarity.SimWeight#normalize(float, float) to get the normalized value.
When performing retrieval, simScorer(SimWeight, AtomicReaderContext) is called to get a Scorer object, which is used to compute the document score.

Query

When we call IndexSearcher.search(Query, Collector), it will create a normalized Weight by calling IndexSearcher.createNormalizedWeight(Query):

/**
 * Creates a normalized weight for a top-level {@link Query}.
 * The query is rewritten by this method and {@link Query#createWeight} called,
 * afterwards the {@link Weight} is normalized. The returned {@code Weight}
 * can then directly be used to get a {@link Scorer}.
 * @lucene.internal
 */
public Weight createNormalizedWeight(Query query) throws IOException {
  query = rewrite(query);
  Weight weight = query.createWeight(this);
  float v = weight.getValueForNormalization();
  float norm = getSimilarity().queryNorm(v);
  if (Float.isInfinite(norm) || Float.isNaN(norm)) {
    norm = 1.0f;
  }
  weight.normalize(norm, 1.0f);
  return weight;
}

The returned Weight is then used to create Scorer, as used in IndexSearcher.search(List<AtomicReaderContext>, Weight, Collector):

Scorer scorer = weight.scorer(ctx, !collector.acceptsDocsOutOfOrder(), true, ctx.reader().getLiveDocs()); //line 618 in IndexSearcher.java

and then the score method in Scorer class is used to do scoring.

scorer.score(collector); //line 621 in IndexSearcher.java

Implement New Retrieval Model

The ranking functions are defined in the Similarity subclasses, thus we just need to provide our own implementation of Similarity to use custom retrieval model. Similarity class instructs both indexing and searching processes. To make it easier, Lucene provides a SimilarityBase class, which already implements many functions and exposes a highly simplified interface.

References

Build Your Own Custom Lucene Query And Scorer