Blogbench Introduction - liweizhao/webbench GitHub Wiki

1. Summary

Blogbench is a database benchmark to simulate database operations of real-done blog applications. Its main features are the authenticity of the simulation of the web application, easy to use and high flexibility. It can be used to test whether a database is suitable for web application.

2. Why develop blogbench

The reason why develop blogbench is that the frequently-used benchmarks available have such disadvantages:

DBT2: DBT2 is a open source implementation of TPS-C, the feature of its application is more of update transactions, less of read transactions. This is different from blog application and lack of flexibility. It can’t be used to only test some type of simple operation, and go against performance optimization.
Sysbench: sysbench has high flexibility. It includes 9 types of operation. Each operation can be tested separately. This is very useful to performance optimization. But the disadvantages of sysbench are that it only uses a fixed length table and the operation is much different from web application like blog. Such as range query on primary key and sort after query according secondary key rarely occurred in web application. So the result of sysbench is not helpful to proving that database adapts to web application.

3. Goal of designing blogbench

In view of this，the design objective of blogbench is:

Simulate real-done blog application. Table schema, feature of data in table, types of operation, and data distributions are designed according our statistic information of blog application.
High performance. The overhead of benchmark itself is much lower than server’s overhead.
Flexible. Each type of operation can be tested separately in order to facilitate analyze and optimize performance bottlenecks.

4. Table schema and data generation

4.1 Table schema

The test table schema of blogbench is shown bellow (of MySQL):

CREATE TABLE Blog ( ID BIGINT NOT NULL PRIMARY KEY, UserID BIGINT, Title VARCHAR(255), Abstract VARCHAR(2000), Content MEDIUMTEXT, AllowView SMALLINT, PublishTime BIGINT, AccessCount INT, CommentCount INT, KEY IDX_BLOG_UID_PUBTIME(UserID, PublishTime, AllowView) );

The corresponding table definition of PostgreSQL and Oracle is shown bellow：

PostgreSQL

CREATE TABLE Blog ( ID BIGINT NOT NULL PRIMARY KEY, UserID BIGINT, Title VARCHAR(255), Abstract VARCHAR(2000), Content TEXT, AllowView SMALLINT, PublishTime BIGINT, AccessCount INT, CommentCount INT );

CREATE INDEX IDX_BLOG_UID_PUBTIME ON Blog(UserID, PublishTime, AllowView);

Oracle

CREATE TABLE Blog ( ID NUMBER(20) NOT NULL PRIMARY KEY, UserID NUMBER(20), Title VARCHAR2(255), Abstract VARCHAR2(2000), Content CLOB, AllowView NUMBER(5), PublishTime NUMBER(20), AccessCount NUMBER(11), CommentCount NUMBER(11) );

CREATE INDEX IDX_BLOG_UID_PUBTIME ON Blog(UserID, PublishTime, AllowView);

4.2 Index building process

For PostgreSQL and Oracle，default behavior is creating primary index when table created, creating secondary index when finish loading data;
For InnoDB storage engine, default behavior is creating primary index and secondary index when table created;
For NTSE(Non Transactional Storage Engine developed by NetEase) storage engine, default behavior is create index none when table created, and create primary index and secondary index when finish loading data.

4.3 Data generate

The data generating rules of blogbench is shown bellow:

ID: from 1 to N，among them N is total number of records;
UserID: from 1 to N/5,，it means that every user have 5 blogs averagely. UserIDs is as the law of zipf distribution(user-zipf-pct%, user-zipf-res%, user-zipf-part). That means user-zipf-pct% users have user-zipf-res% total blogs, and parted into user-zipf-part levels;
Title: Title is a string filled with random characters , of which length is as the law of uniform distribution(min-title-size, max-title-size);
Abstract: Abstract is a string from a real blog article, of which length is as the law of uniform distribution(min-abs-size, max-abs-size);
Content: Abstract is a string from a real blog article, of which length is as the law of gama distribution(min-cnt-size, max-cnt-size, avg-cnt-size);
AllowView: random value, 91% is -100, 1% is 100, 8% is 10000;
PublishTime: current time in format of milliseconds;
AccessCount: initialized to 0;
CommentCount: initialized to 0;

You can use many threads to load data concurrently.