Home - NPLPackages/TableDB GitHub Wiki
On the basis of NPL Raft implementation, TableDB Raft implementation will be much easier. But the implementation can differ much, with the compare between actordb and rqlite.
- rqlite is easy, it simply use SQL statement as Raft Log Entry.
- actordb goes more complicate:
Actors are replicated using the Raft distributed consensus protocol. Raft requires a write log to operate. Because our two engines are connected through the SQLite WAL module, Raft replication is a natural fit. Every write to the database is an append to WAL. For every append we send that data to the entire cluster to be replicated. Pages are simply inserted to WAL on all nodes. This means the leader executes the SQL, but the followers just append to WAL. 
utilize the msg in IORequest:Send, the log entry looks like below:
function RaftLogEntryValue:new(query_type, collection, query) 
    local o = {
      query_type = query_type, 
      collection = collection:ToData(),
      query = query, 
      cb_index = index,
      serverId = serverId,
    };
    setmetatable(o, self);
    return o;
endand commint in state machine looks like below:
--[[
 * Commit the log data at the {@code logIndex}
 * @param logIndex the log index in the logStore
 * @param data 
 ]]--
function RaftTableDB:commit(logIndex, data)
    -- data is logEntry.value
    local raftLogEntryValue = RaftLogEntryValue:fromBytes(data);
    local cbFunc = function(err, data)
        local msg = {
            err = err,
            data = data,
            cb_index = raftLogEntryValue.cb_index,
        }
        -- send Response
        RTDBRequestRPC(nil, raftLogEntryValue.serverId, msg)
    end;
    collection[raftLogEntryValue.query_type](collection, raftLogEntryValue.query.query,
            raftLogEntryValue.query.update or raftLogEntryValue.query.replacement,
            cbFunc);
    self.commitIndex = logIndex;
endthe client interface keeps unchanged, we provide a script/TableDB/RaftSqliteStore.lua, This also need to add StorageProvider:SetStorageClass(raftSqliteStore) method to StorageProvider. RaftSqliteStore will send the Log Entry above to the raft cluster in each interface and could also consider consistency levels.
Like the original interfaces, callbacks is also implemented in a async way. But we make the connect to be sync to alleviate the effect.
Like rqlite, we use sqlite's Online Backup API to make snapshot.
Raft core logic is in RaftServer. TableDB is implemented as Raft Statemachine, see RaftTableDBStateMachine.
below is Figure1 in the raft thesis. It tells the process of a Raft replicated statemachine architechture.

TableDBApp:App -> create RpcListener && TableDB StateMachine -> set RaftParameters -> create RaftContext -> RaftConsensus.run
	RaftConsensus.run -> create RaftServer -> start TableDB StateMachine -> RpcListener:startListening
		RaftServer: has several PeerServers, and ping heartbeat at small random interval when it is the leader
		RpcListener: at Raft level, route message to RaftServer
		TableDB StateMachine: TableDB level, call messageSender to send message to RaftServerTableDBApp:App -> create RaftSqliteStore && TableDB StateMachine -> create RaftClient -> RaftSqliteStore:setRaftClient -> send commands to the cluster
	commands: appendEntries, addServer, removeServer.
	send commands to the clusterοΌwill retry
	RaftSqliteStore: use RaftClient send various commands to Cluster and handle response.
	TableDB StateMachine: send response to the client when the command is committed.below is the structure of a server directory
.
βββ assets.log
βββ cluster.json
βββ config.properties
βββ log.txt
βββ server.state
βββ snapshot
β   βββ 12-1_User_s.db
β   βββ 12.cnf
β   βββ 18-1_User_s.db
β   βββ 18.cnf
β   βββ 24-1_User_s.db
β   βββ 24.cnf
β   βββ 6-1_User_s.db
β   βββ 6.cnf
βββ store.data
βββ store.data.bak
βββ store.idx
βββ store.idx.bak
βββ store.sti
βββ store.sti.bak
βββ temp
    βββ test_raft_database
        βββ User.db
        βββ User.db-shm
        βββ User.db-wal
- 
assets.logandlog.txtare NPL's log files.
- 
cluster.jsonis the cluster servers's information, see below.
- 
config.propertiesis this server's id.
- 
server.statestores this server's Raft state, which isterm,commitIndex,votedFor.
- 
snapshotis the directory contains the snapshot of the statemachine.- each snapshot contains 2 files.
- 
$logIndex-$term_$collection_s.dbis the snapshot data of the statemachine.
- 
$logIndex.cnfis the cluster configure at thelogIndex.
 
- 
store.datastores this server's Raft log entries data, andstore.data.bakis its backup.
- 
store.idxstores the index data ofstore.data, index data is the file position of log entries instore.data.store.idx.bakis its backup.
- 
store.stistores this server's log start index. andstore.sti.bakis its backup.
- 
temp/test_raft_databaseis the rootFolder of TableDatabase.
Note: we may move some state above into sqlite like actordb.
To use TableDB Raft, we need to add a tabledb.config.xml file in the Tabledb rootFolder.
Note
npl_package/mainshould not have therootFolder, otherwise it will affect the application'srootFolderand its config. AND also care about theDevelopment directoryandWorkingDir.
<tabledb>
	<providers>
		<provider type="TableDB.RaftSqliteStore" name="raft" file="(g1)npl_mod/TableDB/RaftSqliteStore.lua">./,localhost,9004,4,rtdb
		</provider>
	</providers>
	<tables>
		<table provider="raft" name="User1"/>
		<table provider="raft" name="User2"/>
		<table provider="raft" name="default" />
	</tables>
</tabledb>Above is an example of the Raft provider config.
- one providerhas 4 configs,- 
typeis used incommonlib.gettable(provider.attr.type).
- 
fileis used inNPL.load(provider.attr.file).
- 
nameis the name of the provider, which is used in thetableconfig.
- 
valuesis the init args used in providertype's init method.
 
- 
- 
tablehave 2 config,- 
provideris the name of the provider, correspond to thenamein provider,
- 
nameis the table name. If the name isdefault, the corresponding provider will be set to be default provider of the database. Ifdefaultis not set, default provider will besqlite.
 
- 
There are 4 values(args) in the above for raft init and separated by ,.
- 
./is the working directory for the node. The working directory should containclustor.jsonfile below.
- 
localhostis the hostname of the node.
- 
9004is port number for the node.
- 
rtdbis the thread name running theRaftServer.
At each server's start up stage, they should know all their peers, this is a json config file like below (setup/init-cluster.json or the cluster.json in the node's working directory):
{
	"logIndex": 0,
	"lastLogIndex": 0,
	"servers":[
	   {
		   "id": 1,
		   "endpoint": "tcp://localhost:9001"
	   },
	   {
		   "id": 2,
		   "endpoint": "tcp://localhost:9002"
	   },
	   {
		   "id": 3,
		   "endpoint": "tcp://localhost:9003"
	   }
	]
}there are several scripts (setup.bat, addsrv.bat, stopNPL.bat) to facillitate the deployment. see setup folder
The Communication use npl_mod/Raft/Rpc.lua, include RaftRequestRPC and RTDBRequestRPC. RaftRequestRPC is used in Raft level and RTDBRequestRPC in TableDB level. They are both used in a Full Duplex way, that is, they are not only used to send request but also recv response.
- For RaftRequestRPC, seeRpcListener:startListeningandPeerServer:SendRequest.
- For RTDBRequestRPC, seeRaftTableDBStateMachine:start,RaftClient:tryCurrentLeaderandRaftTableDBStateMachine:start2.
- Snapshot is made for every collection in TableDB, that is, one collection one snapshot. And each collection can have multiple snapshots at different log index. The snapshot is not deleted at the moment(should we?).
- Snapshot is created in a different thread, see RaftTableDBStateMachine:createSnapshot. If one Snapshot is failed to create, we try to copy the prev snapshot to current log index.
Phase 2 will mainly boost the throughput performance with the following way:
- use sqlite's WAL log as the Raft log.οΌmainly)
- compress the log.
- actordb use 2 engines (sqlite and lmdb) which are connected by WAL. we could employ this, and should in a progressive way.
- LMDB is fast. See Reference 1. and that's(30 Million per node) enough in our situation.
On smaller runs (30Million or less), LMDB came out on top on most of the metrics except for disk size. This is actually what weβd expect for B-trees: theyβre faster the fewer keys you have in them.
- actordb will change the architecture in the next version (other than sqlite and lmdb), this may be a clue about the new achitecture.
- either way need to change the current raft implementation.
- snapshot handle of the statemachine is not handled in actoredb. we could leave it as the previous.
a more practical roadmap:
- hack sqlite wal to get the wal data(may be frame) as the raft log.
- reorganize the Raft implementation to adapt to the wal, include deal with the sqlite checkpoint.
- test the throuput and fix.
- 
InsertThroughputNoIndex:a millionnon-indexed insert operation with1000concurrent jobs takes2167.81s, while non-raft implementation takes150.346s, both withparaengineclient_d.exe, best effort with NPL.