Lucene - cllu/.rc GitHub Wiki
This article illustrate the usage of Lucene, a popular and mature package for information retrieval. I use it heavily in my research, so this article is biased to modifying it to use as research tools.
The codes are written in Scala.
How it works
Both indexing and retrieval relies on the Analyzer
, which will do parsing(text/XML/HTML/PDF etc.), tokenization, stemming etc.
Another thing is the Similarity
, which will collect term statistics during indexing.
val luceneVersion = Version.LUCENE_46
val analyzer = new EnglishAnalyzer(luceneVersion)
val similarity = new LMDirichletSimilarity(200)
Another common variable is the index location of class Directory
:
val indexDirectory:Directory = new NIOFSDirectory(new File('index'))
// or create a in-memory index(do not use it for huge indexes)
//val indexDirectory = new RAMDirectory()
Indexing
val indexWriterConfig = new IndexWriterConfig(luceneVersion, analyzer)
val indexWriter = new IndexWriter(indexDirectory, indexWriterConfig)
val doc = new Document()
val idField = new StringField("_id", "007", Field.Store.YES)
val textField = new TextField("text", "example text", Field.Store.NO)
indexWriter.addDocument(doc)
Query Parsing
Lucene provides buit-in query parsing functions. However, here we illustrate the process by building a simple parser.
def parseQuery(analyzer: Analyzer, text: String): Query = {
// build the query
val query = new BooleanQuery()
val tokenStream = analyzer.tokenStream("text", text)
val termAtt = tokenStream.addAttribute(classOf[CharTermAttribute])
tokenStream.reset()
while (tokenStream.incrementToken()) {
val token = termAtt.toString
// remove punctuations
if (token.length > 1 && !moreStopwords.contains(token)) {
val term = new Term("text", token)
query.add(new TermQuery(term), BooleanClause.Occur.SHOULD)
}
}
tokenStream.close()
query
}
Retrieval
The retrieval is performed by the IndexSearcher.search()
function.
val reader = DirectoryReader.open(indexDirectory)
val searcher = new IndexSearcher(reader)
searcher.setSimilarity(similarity)
val collector = TopScoreDocCollector.create(numResultsPerQuery, true)
val query = parseQuery(analyzer, "example query")
searcher.search(query, collector)
val hits = collector.topDocs().scoreDocs
for (hit <- hits) {
val score = hit.score
val docId = searcher.doc(hit.doc).getField("_id").stringValue()
println(s"$docId")
}
Cleanup
The indexDirectory
should be closed.
indexDirectory.close()
Internals
Similarity
The Similarity
class provides information about normalization and weighting.
At index time, indexer will call computeNorm(FieldInvertState)
method to store a per-document value for the field, which can be retrieved using AtomicReader#getNormValues(String)
.
At query time,
- Before query,
computeWeight(float, CollectionStatistics, TermStatistics...)
is called to compute collection-level statistics, encoded in aSimilarity.SimWeight
object. - When analyzing queries,
Similarity.SimWeight#getValueForNormalization()
is called for queryleaf node
, andSimilarity#queryNorm(float)
is called for the top-level query node. These values are used inSimilarity.SimWeight#normalize(float, float)
to get the normalized value. - When performing retrieval,
simScorer(SimWeight, AtomicReaderContext)
is called to get aScorer
object, which is used to compute the document score.
Query
When we call IndexSearcher.search(Query, Collector)
, it will create a normalized Weight
by calling IndexSearcher.createNormalizedWeight(Query)
:
/**
* Creates a normalized weight for a top-level {@link Query}.
* The query is rewritten by this method and {@link Query#createWeight} called,
* afterwards the {@link Weight} is normalized. The returned {@code Weight}
* can then directly be used to get a {@link Scorer}.
* @lucene.internal
*/
public Weight createNormalizedWeight(Query query) throws IOException {
query = rewrite(query);
Weight weight = query.createWeight(this);
float v = weight.getValueForNormalization();
float norm = getSimilarity().queryNorm(v);
if (Float.isInfinite(norm) || Float.isNaN(norm)) {
norm = 1.0f;
}
weight.normalize(norm, 1.0f);
return weight;
}
The returned Weight
is then used to create Scorer
, as used in IndexSearcher.search(List<AtomicReaderContext>, Weight, Collector)
:
Scorer scorer = weight.scorer(ctx, !collector.acceptsDocsOutOfOrder(), true, ctx.reader().getLiveDocs()); //line 618 in IndexSearcher.java
and then the score
method in Scorer
class is used to do scoring.
scorer.score(collector); //line 621 in IndexSearcher.java
Implement New Retrieval Model
The ranking functions are defined in the Similarity
subclasses, thus we just need to provide our own implementation of Similarity
to use custom retrieval model.
Similarity
class instructs both indexing and searching processes.
To make it easier, Lucene provides a SimilarityBase
class, which already implements many functions and exposes a highly simplified interface.