Tutorial_English - fujimizu/stupa GitHub Wiki

Introduction

Stupa is an associative search engine. You can search related documents with high performance and high precision. Since document data and inverted indexes are kept in memory, stupa reflects updates of documents in search results in real time.

A server implementation of Stupa is possible by using Thrift.

What's new

2010/06/30: stupa-0.1.2 Released
- modified the install path of header files
2010/04/19: stupa-0.1.1, stupa-thrift-0.1.1 Released

License

GPL2 (Gnu General Public License Version 2)

Download

Please download the latest version.

Installation

C++ library of Stupa

Download the latest source package, and type the following.

I recommend that you should install google-sparsehash in advance to run stupa faster.

% tar xvzf stupa-*.*.*.tar.gz
% cd stupa-*.*.*
% ./configure
% make
% make check
% sudo make install

Server implementation using Thrift

Download the latest source package, and type the following. You must install C++ library of Stupa and Thrift in advance.

% tar xvzf stupa-thrift-*.*.*.tar.gz
% cd stupa-thrift-*.*.*
% make

Format of input data

Stupa can read two formats of input data, tab-separated text format files and binary format files. In this section I show the details about text format, and binary format is described in command-line tools section.

List of input documents

The file of the list of input documents needs to be in a tab-separated text format as bellow. Each line must contain the information about only one document, the identifier of a document followed by the keywords in the document. Blank lines must not be inserted.

document_id1 \t key1-1 \t key1-2 \t key1-3 \t ...\n
document_id2 \t key2-1 \t key2-2 \t key2-3 \t ...\n
...

document_id: This is the identifier of a document, and must be the unique string which contains no tab characters.
key: This is the keyword in a document, and must be the string which contains no tab characters.

key needs to reflect the degree of the feature appropriately. So, you should apply weighting scheme (e.g. tf-idf) to input data, and pick up the characteristic keywords in each document. And the number of the keywords of each document should be nearly equal.

Examples of input data

Alex   Pop      R&B      Rock
Bob    Jazz     Reggae
Dave   Classic  World    
Ted    Jazz     Metal    Reggae
Fred   Pop      Rock     Hip-hop
Sam    Classic  Rock

In the above example, 'Alex', 'Bob', 'Ted' are the identifiers of the documents, and 'Alex' has the features as the genres of music (Pop, R&B, Rock) which he often listens.

Usage of command line tools

You can search related documents interactively, and convert tab-separated text format files to binary format files.

stpctl: Stupa Search utility

Usage:
 % stpctl search [-b][-f] file [invsize]
 % stpctl save [-b] infile outfile [invsize]
    -b        read binary format file
    -f        search by feature strings
              (default: search by document identifier strings)
    invsize   maximum size of inverted indexes (default:100)

Search related documents interactively

You can search related documents interactively when you specify search option. By default Stupa reads input data as a text format file. When you specify -b option, Stupa reads input data as a binary format file.

% stpctl search data/test1.tsv 
Reading input documents (Text, invsize:100) ... 6 documents
Query>

You can specify the maximum size of each posting list of inverted indexes. If you set a small value, the search time might be shortened, but old documents might rarely be in the search results.

% stpctl search data/test1.tsv 3
Reading input documents (Text, invsize:3) ... 6 documents
Query>

Maximum size of posting lists

I show the maximum size of posting lists of inverted indexes in detail. Stupa keeps inverted indexes to search related documents quickly. For each keyword, inverted indexes keep a posing list which contains the list of document identifiers whose document contains the keyword.

key1  =>  document_id1  document_id2  document_id3 ...
key2  =>  document_id4  document_id5  document_id5 ...
...

You can specify the maximum size of posting lists. When documents are added beyond the maximum size of posing lists, Stupa removes the oldest documents.

Search by document identifiers

By default Stupa searches related documents by documents identifiers.

When you specify a document identifier after "Query>" prompt, pairs of document identifiers which are related to query documents, and points are displayed. The maximum number of search results is 20. And you can specify multiple document identifiers using tab delimiters.

% cat data/test1.tsv
Alex   Pop      R&B     Rock
Bob    Jazz     Reggae
Dave   Classic  World
Ted    Jazz     Metal   Reggae
Fred   Pop      Rock    Hip-hop
Sam    Classic  Rock

% stpctl search data/test1.tsv
Read input documents ...
Query> Alex
Alex    2.995732
Fred    1.609438
Sam     0.693147
(search time: 0.03ms)

Query> Dave
Dave    2.302585
Sam     0.916291
(search time: 0.02ms)

If you specify the document identifiers which are not included in input data, no search results are displayed.

Search by feature identifiers

With -f option, you can search related documents by keywords in input documents.

% stpctl -f search data/test1.tsv 
Reading input documents (Text, invsize:100) ... 6 documents
Query> Rock
Sam     0.693147
Fred    0.693147
Alex    0.693147
(search time: 0.06ms)

Query> Classic  Rock
Sam     1.609438
Dave    0.916291
Fred    0.693147
Alex    0.693147
(search time: 0.05ms)

When you search related documents by document identifiers, query documents must be added to Stupa preliminarily. If you want to search using documents which aren't added to Stupa yet, you can search related documents using the keywords in input documents as a query.

Save input documents to a binary format file

Reading text format files needs some processes such as parsing text format file, adding identifiers to inverted indexes, and so on. So you should use binary format files if you often use same large scale files as input. To use binary format files, convert text format files into binary format files using save option. The arguments of save option are the path of input file, the path of output file, the maximum size of posting lists of inverted indexes.

% stpctl save input.tsv output_binary 100

A server implementation of Stupa reads only binary format files.

Server implementation using Thrift

A server implementation of Stupa using Thrift contains ThreadPoolServer using a thread pool schema, and NonblockingServer using non-blocking I/O schema. NonblockingServer may be more high performance.

Start up server

Start up ThreadPoolServer

% ./stupa_thread

Start up NonblockingServer

% ./stupa_nonblock

Start-up options

You can specify the following options on both ThreadPoolServer and NonblockingServer.

-p port     port number (default:9090)
-d num      maximum number of keeping documents (default: no limit)
-i size     maximum size of inverted indexes (default:100)
-w nworker  number of worker threads (default:4)
-f file     load a file (binary format file only)
-h          show help message

Save status

Stupa keeps document data and inverted indexes in memory. Therefore, their data will be cleared when Stupa server is stopped. Calling 'save' method from clients, the status is saved to a binary format file. Saved data can be used as input of Stupa server start-up, the command-line search tool, and 'load' method.

Perl client

ThreadPoolServer

% ./client_sample.pl --buffered
Add: Alex       Pop, R&B, Rock
Add: Bob        Jazz, Reggae
Add: Dave       Classic, World
Add: Ted        Jazz, Metal, Reggae
Add: Fred       Pop, Rock, Hip-hop
Add: Sam        Classic, Rock
Size: 6
Search Result (by document ids):
 1      Fred    2.995732
 2      Alex    1.609438
 3      Sam     0.693147
Search Result (by feature ids):
 1      Fred    1.609438
 2      Alex    1.609438
 3      Sam     0.693147

NonblockingServer

% ./client_sample.pl --framed
Add: Alex       Pop, R&B, Rock
Add: Bob        Jazz, Reggae
Add: Dave       Classic, World
Add: Ted        Jazz, Metal, Reggae
Add: Fred       Pop, Rock, Hip-hop
Add: Sam        Classic, Rock
Size: 6
Search Result (by document ids):
 1      Fred    2.995732
 2      Alex    1.609438
 3      Sam     0.693147
Search Result (by feature ids):
 1      Fred    1.609438
 2      Alex    1.9438
 3      Sam     0.693147

Make client libraries in other programming languages

The definition of Thrift interfaces of Stupa is included in "stupa.thrift" file. You can make client libraries in other programming languages using this definition file.

For details, see Thrift documents.