Alternate Data Access Methods - ghrgriner/anki-stats GitHub Wiki

Accessing the SQLite Database Directly

This page discusses the different ways to extract the cards and review data needed to generate the statistical reports. The initial implementation of this package only supported input in text file format. However, the run-time of the export is relatively slow.

The run-time can be improved by following the suggestion in the Anki Manual and accessing the SQLite database directly. This also allows calculation of FSRS retrievability that matches the figure in the statistics window.

These two options are available by setting INPUT_MODE = INPUT_MODE_TEXT or INPUT_MODE = INPUT_MODE_SQLITE in parameters.py. Please note the warning in parameters.py.

# We recommend using INPUT_MODE_TEXT. If INPUT_MODE_SQLITE is used, the
# INPUT_FILE must be the path to a collection.anki2 file. We only provide a
# connection function in `db.py` that is meant to open the database read-only,
# but nevertheless, you should not attempt to write to the Anki database as
# this can easily corrupt the collection. Therefore, only use INPUT_MODE_SQLITE
# if you have reviewed the code in this package and are satisfied with the
# database operations being performed. Alternatively, make a copy of your
# `collection.anki2` for use with this package and specify the path to the
# copy.

Some additional comments are as follows:

Anki will lock the database when it uses it, so Anki must be closed before accessing the database externally.
Similarly, if the database is accessed externally and Anki is opened when the connection is open, Anki will give an error (or at least, it did for us while testing).
The Anki documentation warns against directly writing to the database. We therefore open a read-only connection when connecting to the database. In the example code below, using uri=True with ?mode=ro appended to the file name will open the database read-only.
The SQLite database file is named collection.anki2 in a directory whose name is the user's Anki profile name (one database file per profile). Refer to the Anki Manual for the location of these profile directories.
Since pandas is already used in the package, we have also used it for the data extraction.
The package extracts all the records from the tables it accesses and then performs any necessary additional exclusions once extracted. There is a parameter DECK_NAME that can be set to restrict to cards whose deck or original deck are a given name or are in subdecks of such a deck. Users will have to update the code if other filtering is desired.

Below is some simple example code that retrieves and prints some of the tables in an Anki collection. Again, running this on a copy of the collection is recommended.

import sqlite3
import pandas as pd

con = sqlite3.connect("file:/path/to/my/TEST/collection.anki2?mode=ro", uri=True)

# Uncommenting this line should give an error, but if testing, do so on a copy only.
#con.execute("CREATE TABLE test_readonly (test_col)")

df_col = pd.read_sql_query("select * from col", con)
df_revlog = pd.read_sql_query("select * from revlog", con)
df_cards = pd.read_sql_query("select * from cards", con)
df_decks = pd.read_sql_query("select * from decks", con)
df_config = pd.read_sql_query("select * from config", con)
df_deck_config = pd.read_sql_query("select * from deck_config", con)
# an automatic table created by SQLite that lists the DB objects
df_master = pd.read_sql_query("select * from sqlite_master", con)
con.close()

print(df_col)
# etc...

Performance

Exporting a Text File

On our test deck of about 32,000 cards (but only 7,000 of which are not new, suspended or buried) with about 110,000 reviews, the export alone took about 25 seconds with another 9 seconds spent waiting for the browser to select all the cards before the program was run.

Access Using SQLite

The anki_stats.py program took 1-2 seconds to run the extraction and creation of the analysis data sets, with another second to generate and print the tables. The same time was required when the exported text file was used as input to anki_stats.py.

Limitations

There is legacy code in Anki for three different cases when scheduling.

No collection creation local timezone offset stored in the database (V1 scheduler)
Creation local timezone offset stored, but no rollover hour (V2 scheduler - legacy)
Creation local timezone offset stored and rollover hour stored (V2 scheduler - new)

We have included code to handle all three cases, but have only tested the last case.

For limitations of the retrievability output, see the Limitations section of the README of the repository.

Extracting FSRS Parameters

We initially thought the stored FSRS parameters would be necessary for calculating FSRS retrievability, but this is not the case. Retrievability is a relatively simple function of stability and the days elapsed since the last review. However, we include here for reference how to extract the FSRS parameters from the database.

FSRS parameters are stored with other deck configuration parameters as a serialized byte string (protocol buffer) in deck_config.config. The FSRS parameters can be extracted using the following steps:

Download a compiler that can compile the proto3 language. On Linux we used protoc, which is the same one used during the Anki build.
Make a subdirectory called proto and copy deck_config.proto and any of its dependencies (collection.proto, decks.proto, generic.proto, sync.proto) to this directory.
Modify each file to remove anki from the package names and import statements.

...
//package anki.deck_config;
package deck_config;

//import "anki/generic.proto";
import "generic.proto";
//import "anki/collection.proto";
import "collection.proto";
//import "anki/decks.proto";
import "decks.proto";
...

See here for the reason we manually changed these files. This is one way to avoid an error when importing the generated deck_config_pb2.py in our Python program.

Compile the files by running within the proto directory:

> protoc --python_out=. -I. *.proto

Copy the *_pb2.py files to the package directory. These files were generated in the same directory as the .proto files.
Install the google3 and protobuf packages in the Python (virtual) environment.
Extract the data by instantiating a class from the generated metaclass:

import deck_config_pb2 

...

def pb2_test():
    dc_df = db.read_sql_query("select * from deck_config")
    bstr = dc_df.loc[dc_df.name == 'MyDeck', 'config'].values[0]
    print(bstr)
    deck_config = deck_config_pb2.DeckConfig.Config()
    deck_config.ParseFromString(bstr)
    print(deck_config)
    print(deck_config.fsrs_params_5)