19_02_06_BDD - TuxML/ProjetIrma GitHub Wiki

Database status

1. Introduction

When refactoring the whole core project, we stumble upon the database part, which provide a data set for Machine Learning algorithm.

The whole database consist in two main tables : compilations and incremental_compilations.

Note : We will mark Primary Key with bold font and will mark Foreign Key with field_name(table.field_name_origin). When both, it will be a fusion of the 2.

The two tables tables are divided as follow :

  • compilations
Field Type Meaning
cid int(11) The unique id of each compilation made. It stands for compilation id. Note that each compilation are unique, but the result can be redundant.
compilation_date datetime When the compilation have been made.
compilation_time float How many time have elapsed during the compilation. If negative, it mainly means that the compilation failed. Time is in seconds.
config_file longblob The configuration file used to specify the Linux kernel configuration. Present in a compressed form.
stdlog_file longblob The standard output of the compilation, different from what was display to the user.
errlog_file longblob The error output of the compilation, different from what was display to the user.
output_file longblob What are display to the user.
core_size int(11) The size of the compiled kernel. Will be 0 if the compilation failed. Size is in Byte.
compressed_size text The size of the compiled kernel, depending on the compression algorithm used. Size is in Byte.
dependencies longtext All package present on the docker image in order to successfully compile the kernel.
gcc_version varchar(32) Version of the GCC and G++ compiler.
libc_version varchar(32) Version of the LDD linker.
core_used int(11) The number of cores used during the compilation.
incremental_mod tinyint(1) States if this compilation have been made in incremental mode.
tuxml_version varchar(32) The version of TuxML when doing this compilation.
git_branch varchar(32) (?) Always blank.
docker_image varchar(32) (?) Always blank.
os varchar(32) The system kernel on which we compile. Could be Linux, Windows, MacOS or Java.
distribution varchar(32) Assuming we are using Linux, the distribution name (e.g. debian, fedora, ...).
distrib_version varchar(32) Assuming we are using Linux, the distribution version.
kernel varchar(32) The system kernel version.
arch varchar(32) The architecture on which we are running the compilation and for which we compile (e.g. x86_x64, arm64, ...).
cpu varchar(128) The brand name of the cpu which is used for the compilation.
cpu_cores int(11) The total number of thread available for this CPU.
cpu_freq varchar(32) A sample of this cpu frequencies.
ram int(11) The ram size, in kB.
mechanical_drive tinyint(1) Determine if we are using an HDD or something else.
  • incremental_compilations
Field Type Meaning
cid_incmod(compilations.cid) int(11) The corresponding cid of a compilation made with incremental mode.
cid_origin(compilations.cid) int(11) The corresponding cid of the compilation which resulting object file are reused to compile the cid_incmod.

2. Problematics

This design present some flaws :

  • The database size : for each compilation, we have a record for every field which isn't size compliant, because the database size can explode easily and some record are a lot redundant.
  • The naming : some field name could be enigmatic, even if we know what we are working with. Also, they can lead to some ambiguity : for example, the kernel field is about the kernel name, kernel version, the kernel on which we compile, the kernel that we compile, etc.
  • The relevance : some field lack accuracy or are interdependent. For example, the cpu_freq field is the actual frequency of the cpu before the compilation. It will be more useful to know the actual range of frequency for the processor rather than some value which isn't really relevant with the compilation.
  • Extensibility : right now, we are try to include some new field, about different version of linux kernel we want to compile, or the boot time and boot log. But it will need to be added directly to the compilations table, which already contains a lot of data and will take time to modify.

3. Proposition

In order to respond to this flaws, we will propose a division in five different table.

Note : From now on, all fields and tables name will follow Snake case with no uppercase. (If you don't what is it : https://en.wikipedia.org/wiki/Snake_case)

  • compilations : From the old compilations table, retains field which is directly related to the compilation result.
Field Type Explanations
cid int(11) Old field compilations.cid.
compilation_date datetime Old field compilations.compilation_date.
compilation_time float Old field compilations.compilation_time.
config_file longblob Old field compilations.config_file.
stdout_log_file longblob Old field compilations.stdlog_file.
stderr_log_file longblob Old field compilations.errlog_file.
user_output_file longblob Old field compilations.output_file.
compiled_kernel_size int(11) Old field compilations.core_size.
compressed_compiled_kernel_size text Old field compilations.compressed_size.
dependencies longtext Old field compilations.dependencies. Note that it is compilation dependent because it can contain additional package who aren't present on the base docker image.
number_cpu_core_used int(11) Old field compilations.core_used.
compiled_kernel_version varchar(32) The version of the kernel we compile. It's a new field, and it allow us to prepare for multiple version of the Linux kernel.
sid(software_environment.sid) int(11) See software_environment table for more.
hid(hardware_environment.hid) int(11) See hardware_environment table for more.
  • incrementals_compilations_relation : it retains the same information as the old incrementals_compilations.
Field Type Explanations
cid(compilations.cid) int(11) Old field incremental_compilations.cid.
cid_base(compilations.cid) int(11) Old field incremental_compilations.cid.
incremental_level int(3) Old field compilations.incremental_mod. See explanation below for more.

Why incremental_level? Actually, when we use the incremental mode, we use the first compilation result as a base for all the other, which is an incremental_level of one. Therefore, if we want to make a chain of compilation, it will only add an incremental_level (2+). Different type of incremental compilation

  • hardware_environment : Since each hardware is unique but can be same across some compilations, we create a table for each specific hardware, which is link to some cid. This behaviour could retain a lot of database size.
Field Type Explanations
hid int(11) The unique hardware id.
architecture varchar(32) Old field compilations.arch.
cpu_brand_name varchar(128) Old field compilations.cpu.
number_cpu_core int(11) Old field compilations.cpu_core.
cpu_max_frequency varchar(32) The max possible frequency of the cpu.
ram_size int(11) Old field compilations.ram.
mechanical_disk boolean Old field compilations.mechanical_drive.

Why boolean instead of tinyint(1) : https://dev.mysql.com/doc/refman/8.0/en/numeric-type-overview.html. TLDR: Boolean type is similar to tinyint(1) but carry a lot more meaning than tinyint(1).

  • software_environment : As the hardware, the software shouldn't change a lot across all the compilation, so we apply the same behaviour.
Field Type Explanations
sid int(11) The unique software id.
system_kernel varchar(32) Old field compilations.os.
system_kernel_version varchar(32) Old field compilations.kernel.
linux_distribution varchar(32) Old field compilations.distribution. Note that it will be empty if the kernel isn't Linux.
linux_distribution_version varchar(32) Old field compilations.distrib_version. Note that it will be empty if the kernel isn't Linux.
gcc_version varchar(32) Old field compilations.gcc_version.
libc_version varchar(32) Old field compilations.libc_version.
tuxml_version varchar(32) Old field compilations.tuxml_version.
  • boot : This database will retain boot testing result, if only the test has be done.
Field Type Explanations
cid(compilations.cid) int(11) Reference to compilations.cid.
boot_time float The time spend to boot the kernel. Will be negative if the kernel hasn't be able to boot.
boot_log_file longblob The log of the boot.
  • other table ? : in the future, each new test or new functionality could be added as a new table, which we will not have a huge impact on the existent database and respect an easier way to extend the project result.

4. Conclusion

The proposed change have been made, with an adjustment about boolean/tinyint(1) : since the database automatically convert boolean to tinyint(1), we will use tinyint(1) for boolean field.

⚠️ **GitHub.com Fallback** ⚠️