ActivityNet Core Components Introduction - actnet/saedb GitHub Wiki

##ActivityNet Core Components Introduction

ActivityNet System has some core technologies as follows:
#####1. Big data access techniques based on Memory Map
#####2. Storage techniques of heterogeneous graphs
#####3. Storage techniques of dynamic graphs (under development)
#####4. Distributed computing techniques based on Vertex Program Interface
#####5. General indexing and searching framework
#####6. RPC component
The details of these techniques are as follows:
####Big data access techniques based on Memory Map
Memory Map is an up-to-date file access techniques supported by modern operating systems. This technique will directly map system files to memory. In SAE big data applications, we have two ways to access data. The first way is to access data in order when conducting large-scale computing. The other way is to randomly access data when conduct small –scale computing.
This technique has advantages in both two patterns when compared with normal data access technique. In large-scale computing, this technique can save time when copying data in memory. In small-scale computing, more than saving time, it can also load useful pages, which can save extra memory.
The data structure of the system is designed especially for Memory Map, which will fasten the loading and running speed as much as possible.
###Storage techniques of heterogeneous graphs
In real-life social network, the structure of the network is very complex. In different nodes and links, the data information and data structures varies. For example, In ArnetMiner;s Citation network, there are different nodes such as Author, Publication, Conference.
To support searching and computing in heterogeneous network, we design and implement storage techniques of heterogeneous graph system.
In the storage system, we use several linear vectors to support heterogeneous data storage and index.
#####1. Node Index Table. In node index table, there are several data fields:

  • Global Node Number, global_id
  • Node Number in its data structures, local_id
  • Node data structure number, data_type
  • First link number of the node. All links of the node will make a vector

#####2. Link Index Table. Link index table includes global link index table, forward link index table, backward link index table and all kinds table index. The data fields of the link includes:

  • Global link number, global_id
  • Link number in its kind, local_id
  • Link kind number, data_type
  • Link source node and the node

#####3. Node data kind info table. This table will record source info of all kinds of node. The data fields include:

  • Data kind name
  • The amount of nodes in this kind

#####4. Link data kind info table. This table will record source info of all kinds of link. The data fields include:

  • Data kind name
  • The amount of nodes in this kind

#####5. Node data table. This table will record data of a node.
#####6. Link data table. This table will record data of a link.

The advantage of our storage techniques is: #####1. Comprehensive indexing system. Different kinds of nodes and links are cooperated via indexing technique, enabling most data can be accessed in O(1) time complexity with high efficiency. #####2. Separating data and index. This technique stores index and data info separately, improving the searching efficiency. #####3. Using file code stream for data storage. This technique helps us process data in a unified framework. ###Storage techniques of dynamic graphs (under development) The dynamic storage technique can implement several functions as follows: #####1. Basic search function of the graph #####2. Add and delete node in the graph #####3. History Version Management function, able to view the graph structure of the past version Via those functions mentioned above, the dynamic graph storage can be implemented in SAE, which means the system can efficiently add, delete, update data, save the historical information meanwhile and roll back a certain past version, implementing a 4-D space dynamic graph. The dynamic graph storage requires the data structures to expand dynamically, satisfy all kinds of search needs, provide back-end server interfaces for front-end applications, which is very practical under the big data environment and makes the dynamic storage more difficult than static storage. More importantly, how to manage and process the large amounts of historical data, how to use the historical data to re-build the past graph, and how to mine more information from past data, these issues will be challenges in both technique aspect and though aspect.

The essence of dynamic graph storage is to store node info and link info separately in different Time Split B-Tree:
#####1. In the Node TSB, the key is the node number, and other info is set as value. User can find a certain value in a certain time with the node id, and add new node or new node info dynamically in the B Tree. (If you want to delete a certain node in the tree, you can just add another tag indicating that the node is deleted.)
#####2. In the Link TSB, the link start node, end node and the link id will be used as key, and the link info will be set as value. We can find a certain value at a certain time given the key. We can also find all related neighbor nodes and links given a certain node. Just like Node TSB, it is also convenient to add, delete, and update a link value in the node tree.
#####3. The index of the data in the graph needs to be built by user manually.

This part is currently under development. After trials, it will function as core of the storage component with heterogeneous graph storage technique ###Vertex Program Computing Interface
When dealing with large-scale graph, social network computing, distributed structure is always necessary. Traditional computing usually works based on shared memory. However, this method is not suitable for distributed social computing because the spatial-time locality of graph structure always performs badly. We use the interface of Vertex Program. The interface is initial by Google Pregel, then developed by GraphLab and PowerGraph and came to the GAS model we are using currently. Specifically, it is the Gather/Apply/Scatter Model. ###Index System Besides large-scale computing, another normal need is to find related node efficiently in social network data. We will have to walk over the whole network to find a certain node to find a certain node with high time complexity without appropriate index system.
Therefor, we developed Index System suitable for graph structure and built inverted index for graph data, which will fasten the query speed ###RPC Component RPC is the abbreviations of Remote Procedure Call. We design a RPC framework to provide services for front-end applications. In SAE DB, there are two places will need RPC. The first is when conduct distributed computing for SAE DB, the system will need instant messaging. The other is when application is using the SAE DB, the system will need an interface to communicate with the front-end server. Common RPC messaging methods include XMLRPC and etc., which always need to encode text data. However, we usually use binary data in practice in case of losing data when transmitting data. Therefore, we design our own RPC messaging method. Meanwhile, we want the method to support as many coding languages, such as Python and Java, as possible. With all these considerations, we chose the combination of ZeroMQ and Protocal Buffer. ZeroMQ is a light cross-stage web-messaging library and easy to expand, in support of multiple messaging patterns.
Protocal Buffer is a set of sequential optimization library initialed by Google. It defines a set of Data Description Language for sequential structured data. Compare with XML, protocol buffer is faster, lighter, and easier to implement when coding.
We apply the combination of ZeroMQ and Protocal Buffer techniques in our system. On the one hand, cross-stage text messaging is very difficult and not our core research point. Our system requires large I/O and easily expanding, which meets the goals of ZeroMQ. On the other hand, we need a general coding plans to make the messaging be supported by multiple coding languages. Although XML is more popular, it is not suitable for large-scale data transportation. Protocol Buffer provides with a cross-language and cross-platform coding methods, which meets our requirements for performances and generality.
###API We provide with API document for every component for convenience.

####serialization In order to store or load different kinds of data into Disk, we implemented a high-efficiency serialization library. We apply the serializestream methods for data serialization and loading. The library can serialize data following user-defined serialization function, and store the data into the disk in binary pattern. The default mode is POD, which will directly copy data in memory without user instruction.
For example, for struct Data { int i; vector<int> j},we can deal with it as follows:
Define Serialization and re-serialization Function:

namespace sae{namespace serialization{namespace custom_serialization_impl{

   template <>
   struct serialize_impl<OSerializeStream, Data> {
       static void run(OSerializeStream& ostr, const Data& d) {
           ostr << d.i << d.j;
       }
   };

   template <>
   struct deserialize_impl<ISerializeStream, Data> {
       static void run(ISerializeStream& istr, Data& d) {
           istr >> d.i >> d.j;
       }
   };

}}}

####Serialization:

Data d;
ofstream fout(“data.bin”, fsteam::binary);
OSerializeStream encoder(&fout);
encoder << d;
Re-Serialization:
Data d;
ifstream fin(“data.bin”, fsteam::binary);
OSerializeStream decoder(&fin);
encoder >> d;

####Heterogeneous network storage

  1. MGraph
  2. GraphBuilder
  3. VertexIterator
  4. EdgeIterator

Dynamic network Storage (under development)

#####1. VertexIterator, the Node Iteration related interface: - vid_t GlobalId():get the global Id of current node - DataType& Data():get the info of current node - void Next():move the iterator to next node - void NextOfType():move the iterator to next node of the same tyepe - void AllVersions():get all version of current node - void MoveTo(vid_t):get the pointer of node id=vid_t - bool Alive():judge whether the iterator is point to a legal object - std::string TypeName():get the type name of current node - uint32_t TypeId(): get the type of current node - vid_t Count():get the amount of all nodes - eid_t InEdgeCount():get the amount of all links pointed to current nodes - eid_t OutEdgeCount():get the amount of all links from current nodes - EdgeIteratorPtr InEdges():get the starter link iterator of the current node - EdgeIteratorPtr OutEdges():get the end link iterator of current node #####2. EdgeInterator, the link iterator related interface: - eid_t GlobalId():get the global id of current link - vid_t SourceId(): get the source node id of current link - vid_t TargetId():get the target node id of current link - VertexIteratorPtr Source():get the pointer of the source node of current link - VertexIteratorPtr Target():get the pointer of the source node of current link - void AllVersions():get all versions of current links - DataType& Data():get the data of current link - void Next():get the next link from the iterator - bool Alive():judge whether the iterator is pointed a legal object - std::string TypeName():get the type name of current link - uint32_t TypeId():get the type id of current link - eid_t Count(): get the amount of current links #####3. TSB Related Interface: - TSB* Open (const char* filename) - Parameters: filename - Return value: TSB tree pointer - int VertexTypeIdOf(const char* type) - Parameters: node type name - Return value: node type id - int EdgeTypeIdOf(const char* type) - Parameters: link type name - Return value: link type id - vid_t VertexCount() - Return Value: the amount of node - eid_t VertexCountOfType(const char* type) - Parameters: node type name - Return value: the amount of node of current type - eid_t EdgeCount() -Return value: the amount of the link - eid_t EdgeCountOfType(const char* type) - Parameters: link type name - Return value: the amount of the link of current type - VertexIteratorPtr Vertices(uint64_t timestamp = current) - Parameters: timestamp - Return value: the node iterator of the timestamp - VertexIteratorPtr VerticesOfType(const char*) - Return value: the node iterator of the type - EdgeIteratorPtr ForwardEdges() - Return value: the forward link iterator - EdgeIteratorPtr BackwardEdges() - Return value: the backward link iterator - EdgeIteratorPtr Edges(uint64_t timestamp = current) -Parameters: time stamp -Return value: the link iterator of the time stamp - bool insertVertex(string typename, DataType data, int optype) -Parameters: the type of node, the info of type, the type of operation -Return value: whether successful or not - bool insertEdge(string typename, vid_t source, vid_t target, DataType data, int optype) -Parameters: the type of link, the info of link, the type of operation -Return value: whether successful or not - void Sync() - Function: progressively sync file data with disk file - void Close() - Function: close the opening data file

####Computing #####VertexProgram Interfaces:
void init(message)
The message is used to initialize the node
gather_type gather(vertex, edges)
At first, gather all related information for every node from neighbor nodes and links for every iteration. The return values is the gather_type. After returning the gather_type, it went into the apply step
void apply(vertex, gather_result)
apply step, the node will modify its value based on gather structure
void scatter(vertex, edges)
In scatter step, the computing structure was distributed into the neighbor nodes.

Based on definitions mentioned above, a pagerank program is as follows:

double RESET_PROB = 0.15;
double TOLERANCE = 1.0E-2;

struct VData {
   double pagerank;
};

struct EData {
   int type;
};

typedef saedb::sae_graph graph_type;

class pagerank:
public saedb::IAlgorithm<graph_type, double, message_data_type>
{
public:
   void init(icontext_type& context, vertex_type& vertex, const message_type& msg) {
       vertex.update<double>(msg);
   }

   edge_dir_type gather_edges(icontext_type& context,
                              const vertex_type& vertex) const{
       return saedb::IN_EDGES;
   }

   double gather(icontext_type& context, const vertex_type& vertex,
                edge_type& edge) const {
       return ((1.0 - RESET_PROB) / edge.source().num_out_edges()) * edge.source().parse<double>();
   }

   void apply(icontext_type& context, vertex_type& vertex,
              const gather_type& total){
       const double newval = total + RESET_PROB;
       vertex.update<double>(newval);
   }

   edge_dir_type scatter_edges(icontext_type& context,
                               const vertex_type& vertex) const{
       return saedb::OUT_EDGES;
   }

   void scatter(icontext_type& context, const vertex_type& vertex,
                edge_type& edge) const {
       context.signal(edge.target());
   }

};

It is very convenient in this coding pattern.

When started computing, first use a computing engine to load graph data as follows:

   graph_type graph;
   graph.load_format(filepath);
   saedb::IEngine<pagerank> *engine = new saedb::EngineDelegate<pagerank>(graph);
   engine->signalAll();
   engine->start();

   // for debug purpose, print out all
   if (FLAGS_printall) {
       for (auto i = 0; i < graph.num_local_vertices(); i ++) {
           cout << "v[" << i << "]: " << graph.vertex(i).parse<double>() << endl;
       }
   }
   delete engine;

####Index and query

The interface for indexing component is as follows:

#####1. Building index interface - void addSingle(int doc, int field, TokenStream* stream, double score) - Add a file into index - Parameters: doc id, doc filed id, doc stream, score - void optimize() - Optimization of index after loading all docs - static Index build(DocumentCollection) - single method of building index for DocumentCollection - DocumentCollection = std::map<int, Document> - Document = {id, std::vector} - Field = {name, value} - double bm25(int freq, int total_tokens, double avg_len) - input score based on bm25 algorithms

#####2. TokenStream Interface TokenStream is a interface used to present the data structures after primary processing primary text data. It always uses a iterator structure to process text data. First, it uses LetterTokenizer to split works. Then it use Lower CaseFukter to make all workds in lower case. At last, it use StemFilter for word stemming and StopFilter to filter stop works.
In SAE DB, we implement a ArnetAnalyzer as a general English text processor.
######TokenStream Interface:
- bool next(Token& token) -get next token - void reset() -reset stream to primary state

#####3. Search Interface - QueryItem = {int doc_id, double score} - SearchResult = std::vector - Searcher::Searcher(const Index& index) - get a searcher for Index - SearchResult Searcher::search(TokenStream* query) - search in index with query - return the search result of related docs and socres.

####RPC

We implement RPC Componet as simple as possible
First use RpcServer::CreateServer(port, threads) to create a multi-thread RPC server with a port. The use server->addMethod(method_name, method_func) to add a function into the sever. The method_func is a function used to receive the return value of protobuf function. The example of method_func is as follows:

bool echo(const string& input, string& output) {
   output = input
   return true;
}

Thus, the client-serve can use any language to communicate with the server from our port via the ZeroMQ.

####Other Components have also implemented several open source library for some non-critical components for convenience. These codes are embedded in the SAE project and not in need of extra downloading. These two libraries are very simple, so we will not introduce it in details here.
#####gflags
gflags provides parsing functions for Command Line Parameters, details can be found here:http://google-gflags.googlecode.com/svn/trunk/doc/gflags.html
#####glog
glog provides the log print function. Details can be found here:http://google-glog.googlecode.com/svn/trunk/doc/glog.html

⚠️ **GitHub.com Fallback** ⚠️