Database status

1. Introduction

When refactoring the whole core project, we stumble upon the database part, which provide a data set for Machine Learning algorithm.

The whole database consist in two main tables : compilations and incremental_compilations.

Note : We will mark Primary Key with bold font and will mark Foreign Key with field_name(table.field_name_origin). When both, it will be a fusion of the 2.

The two tables tables are divided as follow :

compilations

Field	Type	Meaning
cid	int(11)	The unique id of each compilation made. It stands for compilation id. Note that each compilation are unique, but the result can be redundant.
compilation_date	datetime	When the compilation have been made.
compilation_time	float	How many time have elapsed during the compilation. If negative, it mainly means that the compilation failed. Time is in seconds.
config_file	longblob	The configuration file used to specify the Linux kernel configuration. Present in a compressed form.
stdlog_file	longblob	The standard output of the compilation, different from what was display to the user.
errlog_file	longblob	The error output of the compilation, different from what was display to the user.
output_file	longblob	What are display to the user.
core_size	int(11)	The size of the compiled kernel. Will be 0 if the compilation failed. Size is in Byte.
compressed_size	text	The size of the compiled kernel, depending on the compression algorithm used. Size is in Byte.
dependencies	longtext	All package present on the docker image in order to successfully compile the kernel.
gcc_version	varchar(32)	Version of the GCC and G++ compiler.
libc_version	varchar(32)	Version of the LDD linker.
core_used	int(11)	The number of cores used during the compilation.
incremental_mod	tinyint(1)	States if this compilation have been made in incremental mode.
tuxml_version	varchar(32)	The version of TuxML when doing this compilation.
git_branch	varchar(32)	(?) Always blank.
docker_image	varchar(32)	(?) Always blank.
os	varchar(32)	The system kernel on which we compile. Could be Linux, Windows, MacOS or Java.
distribution	varchar(32)	Assuming we are using Linux, the distribution name (e.g. debian, fedora, ...).
distrib_version	varchar(32)	Assuming we are using Linux, the distribution version.
kernel	varchar(32)	The system kernel version.
arch	varchar(32)	The architecture on which we are running the compilation and for which we compile (e.g. x86_x64, arm64, ...).
cpu	varchar(128)	The brand name of the cpu which is used for the compilation.
cpu_cores	int(11)	The total number of thread available for this CPU.
cpu_freq	varchar(32)	A sample of this cpu frequencies.
ram	int(11)	The ram size, in kB.
mechanical_drive	tinyint(1)	Determine if we are using an HDD or something else.

incremental_compilations

Field	Type	Meaning
*cid_incmod*(compilations.cid)	int(11)	The corresponding cid of a compilation made with incremental mode.
*cid_origin*(compilations.cid)	int(11)	The corresponding cid of the compilation which resulting object file are reused to compile the *cid_incmod*.

2. Problematics

This design present some flaws :

The database size : for each compilation, we have a record for every field which isn't size compliant, because the database size can explode easily and some record are a lot redundant.
The naming : some field name could be enigmatic, even if we know what we are working with. Also, they can lead to some ambiguity : for example, the kernel field is about the kernel name, kernel version, the kernel on which we compile, the kernel that we compile, etc.
The relevance : some field lack accuracy or are interdependent. For example, the cpu_freq field is the actual frequency of the cpu before the compilation. It will be more useful to know the actual range of frequency for the processor rather than some value which isn't really relevant with the compilation.
Extensibility : right now, we are try to include some new field, about different version of linux kernel we want to compile, or the boot time and boot log. But it will need to be added directly to the compilations table, which already contains a lot of data and will take time to modify.

3. Proposition

In order to respond to this flaws, we will propose a division in five different table.

Note : From now on, all fields and tables name will follow Snake case with no uppercase. (If you don't what is it : https://en.wikipedia.org/wiki/Snake_case)

compilations : From the old compilations table, retains field which is directly related to the compilation result.

Field	Type	Explanations
cid	int(11)	Old field compilations.cid.
compilation_date	datetime	Old field compilations.compilation_date.
compilation_time	float	Old field compilations.compilation_time.
config_file	longblob	Old field compilations.config_file.
stdout_log_file	longblob	Old field compilations.stdlog_file.
stderr_log_file	longblob	Old field compilations.errlog_file.
user_output_file	longblob	Old field compilations.output_file.
compiled_kernel_size	int(11)	Old field compilations.core_size.
compressed_compiled_kernel_size	text	Old field compilations.compressed_size.
dependencies	longtext	Old field compilations.dependencies. Note that it is compilation dependent because it can contain additional package who aren't present on the base docker image.
number_cpu_core_used	int(11)	Old field compilations.core_used.
compiled_kernel_version	varchar(32)	The version of the kernel we compile. It's a new field, and it allow us to prepare for multiple version of the Linux kernel.
sid(software_environment.sid)	int(11)	See software_environment table for more.
hid(hardware_environment.hid)	int(11)	See hardware_environment table for more.

incrementals_compilations_relation : it retains the same information as the old incrementals_compilations.

Field	Type	Explanations
*cid*(compilations.cid)	int(11)	Old field incremental_compilations.cid.
*cid_base*(compilations.cid)	int(11)	Old field incremental_compilations.cid.
incremental_level	int(3)	Old field compilations.incremental_mod. See explanation below for more.

Why incremental_level? Actually, when we use the incremental mode, we use the first compilation result as a base for all the other, which is an incremental_level of one. Therefore, if we want to make a chain of compilation, it will only add an incremental_level (2+). Different type of incremental compilation

hardware_environment : Since each hardware is unique but can be same across some compilations, we create a table for each specific hardware, which is link to some cid. This behaviour could retain a lot of database size.

Field	Type	Explanations
hid	int(11)	The unique hardware id.
architecture	varchar(32)	Old field compilations.arch.
cpu_brand_name	varchar(128)	Old field compilations.cpu.
number_cpu_core	int(11)	Old field compilations.cpu_core.
cpu_max_frequency	varchar(32)	The max possible frequency of the cpu.
ram_size	int(11)	Old field compilations.ram.
mechanical_disk	boolean	Old field compilations.mechanical_drive.

Why boolean instead of tinyint(1) : https://dev.mysql.com/doc/refman/8.0/en/numeric-type-overview.html. TLDR: Boolean type is similar to tinyint(1) but carry a lot more meaning than tinyint(1).

software_environment : As the hardware, the software shouldn't change a lot across all the compilation, so we apply the same behaviour.

Field	Type	Explanations
sid	int(11)	The unique software id.
system_kernel	varchar(32)	Old field compilations.os.
system_kernel_version	varchar(32)	Old field compilations.kernel.
linux_distribution	varchar(32)	Old field compilations.distribution. Note that it will be empty if the kernel isn't Linux.
linux_distribution_version	varchar(32)	Old field compilations.distrib_version. Note that it will be empty if the kernel isn't Linux.
gcc_version	varchar(32)	Old field compilations.gcc_version.
libc_version	varchar(32)	Old field compilations.libc_version.
tuxml_version	varchar(32)	Old field compilations.tuxml_version.

boot : This database will retain boot testing result, if only the test has be done.

Field	Type	Explanations
*cid*(compilations.cid)	int(11)	Reference to compilations.cid.
boot_time	float	The time spend to boot the kernel. Will be negative if the kernel hasn't be able to boot.
boot_log_file	longblob	The log of the boot.

other table ? : in the future, each new test or new functionality could be added as a new table, which we will not have a huge impact on the existent database and respect an easier way to extend the project result.

4. Conclusion

The proposed change have been made, with an adjustment about boolean/tinyint(1) : since the database automatically convert boolean to tinyint(1), we will use tinyint(1) for boolean field.

19_02_06_BDD - TuxML/ProjetIrma GitHub Wiki

Database status

1. Introduction

2. Problematics

3. Proposition

4. Conclusion

⚠️ GitHub.com Fallback ⚠️

19_02_06_BDD - TuxML/ProjetIrma GitHub Wiki

Database status

1. Introduction

2. Problematics

3. Proposition

4. Conclusion

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️