2016 Key Workflows Analysis (Historic) - IntegratedBreedingPlatform/Documentation GitHub Wiki

Create Trial

Open Trial Sequence Diagram

Profiling observations

Defining Mega Trial as trial with 50 Germplasm entries, 3 replications, 25 Traits, 200 Locations with each location having one location property LOCATION_ABBR and one condition SITE_SOIL_PH. No measurements as we can not create measurements before first save.

There are two major areas where most of the backend time is spent:

Experimental design: Takes an average of 38,552 ms for generating randomized complete block design for 200 environments. BMS spawns VSNI BVDesign engine Windows executable in loop for each environment which is indeed time consuming. Design per location might be a limitation of the VSNI engine, we are not entirely certain.
Insert Trial Data Records: Average time spent at the backend for saving trial data to database, was 26,943 ms. Most of the time is spent in inserts related to observation units (nd_experiment table) and their design related properties (nd_experimentprop table). See the query stats below for details.

Query Stats

Core query stats for creating mega trial:

Number of selects : 215,720 (214,026 of them are: select @@session.tx_read_only. Eliminated by adding useLocalSessionState=true in MySQL driver connection properties). So 1694 selects from BMS. 722 prepare select statements, 1448 execute selects.
Number of inserts : 151,934 (includes prepare statements)

Key categories of database inserts:

Create trial environment/instance/location (nd_geolocation/nd_geolocationprop) records. 4 inserts per environment as below. Total 800 inserts.

insert into nd_geolocation (altitude, description, geodetic_datum, latitude, longitude) values (NULL, '1', NULL, NULL, NULL)

insert into nd_geolocationprop (nd_geolocation_id, rank, type_id, value) values (1, 2, 8189, 'Loc1') <-- 8189 = LOCATION_ABBR insert into nd_geolocationprop (nd_geolocation_id, rank, type_id, value) values (1, 3, 8135, '10110') <-- 8135 = EXPT_DESIGN insert into nd_geolocationprop (nd_geolocation_id, rank, type_id, value) values (1, 4, 8131, '3') <-- 8131 = NREP

Create Basic trial information (project/projectprop) records.

3 inserts into project (3 datasets) 161 inserts into projectprops (We only have one extra attribute STUDY_INSTITUTE = CIMMYT in settings tab)

3 inserts (cvterm id, name, definition definintion) for property name and 1 insert for property value, e.g. study name is stored with:

 insert into projectprop (project_id, rank, type_id, value) values (25007, 2, 1805, 'STUDY_NAME')					<-- 1805 = Study Detail
 insert into projectprop (project_id, rank, type_id, value) values (25007, 2, 1060, 'Study - assigned (DBCV)')		<-- 1060 = Variable description
 insert into projectprop (project_id, rank, type_id, value) values (25007, 2, 1070, '8005')							<-- 1070 = Standardvariableid
 insert into projectprop (project_id, rank, type_id, value) values (25007, 2, 8005, 'Mega Trial 764')				<-- 8005 = STUDY_NAME

Create observation unit (plots) and related information records. MAJOR HOTSPOT

200 inserts for the SITE_SOIL_PH value for each environment condition insert into phenotype (assay_id, attr_id, cvalue_id, name, observable_id, uniquename, value, phenotype_id) values (NULL, NULL, NULL, '8270', 8270, NULL, '6', 500) <-- 8270 = SITE_SOIL_PH

200 inserts for join table for the SITE_SOIL_PH value for each environment condition insert into nd_experiment_phenotype (nd_experiment_id, phenotype_id, nd_experiment_phenotype_id) values (501, 500, 500)

30000 inserts into nd_experiment 30000 inserts into nd_experiment_project (join table) 60000 inserts into nd_experimentprop table (because of two properties per plot PLOT_NO and REP_NO)
```
   insert into nd_experimentprop (nd_experiment_id, rank, type_id, value) values (702, 9, 8210, '1')   <-- 8210 = REP_NO
   insert into nd_experimentprop (nd_experiment_id, rank, type_id, value) values (702, 10, 8200, '2')  <-- 8200 = PLOT_NO
```
30000 inserts into nd_experiment_stock (join table)
```
 insert into nd_experiment_stock (nd_experiment_id, stock_id, type_id) values (701, 1, 1000)  <-- 1000= IBDB structure. Dont know why we do this!
 insert into nd_experiment_stock (nd_experiment_id, stock_id, type_id) values (702, 2, 1000)
 insert into nd_experiment_stock (nd_experiment_id, stock_id, type_id) values (703, 3, 1000)
 ...
 insert into nd_experiment_stock (nd_experiment_id, stock_id, type_id) values (30700, 21, 1000)
```
50 inserts into stock (50 entries) 50 inserts into stockprop (50 entries entry type value for each)
```
 insert into stockprop (rank, stock_id, type_id, value) values (5, 1, 8255, '10170')      <-- 8255 = ENTRY_TYPE, 10170 = Test Entry
```
Save snapshot of germplasm list for trial (ListData_Project - list copy)

1 insert into listnms for saving list copy name insert into listnms (listdate, listdesc, eDate, listlocn, listref, listname, notes, lhierarchy, program_uuid, projectId, sdate, liststatus, listtype, listuid) values (20160616, 'Test of large germplasm list import', NULL, NULL, 2, '50 maize list test', '', NULL, '36ad2439-9e16-47ae-bf72-355be4c1a764', 25007, NULL, 1, 'TRIAL', 1)

50 inserts into listdata_project to save list entries like:
```
 insert into listdata_project (check_type, designation, duplicate_notes, entry_code, entry_id, germplasm_id, group_name, list_id, seed_source) values (10170, 'CML502-123', NULL, '1', 1, 5, 'CML176/CML264', 3, 'AF06A-251-2')
```

JAMon Timing markers

On Windows environment (Quad Core 16GB RAM on Google Cloud) with VSNi BVDesign engine:
On Mac OSX (2.4 GHz, 8GB RAM) with mocked design engine (hence the design generation part shows low timings)

Open Trial

Query stats:

Total select queries executed: 1241. Of these 527 are preparing the select statements and 617 of them are actual execution of selects. 43 other queries e.g connection pool test queries (Select 1).
Of the 617 selects:
- 68 selects from nd_experiment batched for 500 rows at a time to load 30000 observation units.
- 65 selects from nd_experimentprop table to load the experiment properties (design coordinates). Batched for 500 experiments per query. There are a total of 60000 props for our mega trial.
- 65 selects from nd_experiment_phenotype (join table to get to plot's phenotypes) batch of 500 phenotype units at a time.
- 65 selects from nd_experiment_stock (join table to get plot's germplasm info) batch of 500 as well. There are a total of 30000 records in this table for our mega trial.
- 200 selects from nd_geolocationprop - this can be easily reduced by adding batch loading.
- 88 selects to cvterm and related tables.
- Other minor queries for stock, stockprop, project, projectprop, user etc. tables

JAMon Timing markers

OpenTrialJamonStats

Profiling observations

For trial with 200 locations (each with one location attribute and one condition), 25 traits, 3 reps, 50 germplasm entries. One environment containing complete measurements = 150 x 25 = 3750 phenotypes.

Backend

Average time spent on the backend : 38,505ms (WorkbookBuilder). Breakdown of time spent:
- 14,429ms (~37%) spent in processing to handle "special" variables that represent Breeding Methods.
- Large remaining portion spent on retrieving observation units (nd_experiment) germplasm (stock) and related join tables and generic properties tables. Query volume is directly proportional to number of observation units with multiple extra rows per unit as meta data in props table.

Average time to render the page : 1.9 minutes. Most of this time is spent in rendering the measurement data table UI.
After configuring deferRender property of data table, the time taken reduces down to 15 seconds

Recommendations for discussion

Fix the modelling of Breeding Methods as ontology variables (after fixing this average backend time drops to ~25 Sec).
Do not load information about all observation units, traits and phenotypes on screen. Currently this is being done to display the entire measurements table on screen which is a major contributor to performance issues with large trials.
- Limit the number of observation units displayed.
- Consider server side pagination. Sorting by traits with server side pagination is going to be extremely complicated with current schema structure.
- Only show certain number of observations and associated phenotypes on screen. Beyond a fixed limit, use excel export to see all data. Depends on how the breeders actually want to interact with and use the whole trial data.
- Consider a completely different way of visualising the measurement table data.

Import Export Measurements

Import Measurements

The Import Measurements flow shows no significant time spent importing measurements. The bottlenecks occur inside the Save Nursery/Trial section of the product, which is covered in the Create/Open Trial section of this page.

Export Measurements

The BMS is littered with transformations, suggesting that we have

mismatch in the data structures we transpose throughout the application
semi-generic structures that need filtering and refining The Breeding Method codes are found in the Methods tables, but the keys that identify these codes are varied and stored in the Ontology. Switching between ids, names and codes for Breeding Method at the plot level is costing us dearly in performance pitfalls in the code.

Open Trial Query Improvements

Open trial performance can be further optimised by writing a complex query. Below is a query that retrieves all measurement related information for a very large trial in a couple of seconds.

    SET SESSION group_concat_max_len = 1000000;

    set @sql = NULL;

SELECT 
    GROUP_CONCAT(DISTINCT 
		CONCAT(
			'Max(IF(cvterm_variable.name = ''',
            cvt.name,
            ''', ph.value, NULL)) AS ',
            cvt.name
		)
	 ) INTO @sql 
FROM
    projectprop pp
        INNER JOIN
    cvterm cvt ON cvt.name = pp.value
WHERE
    type_id = 1808
        AND project_id = (SELECT 
            p.project_id
        FROM
            project_relationship pr
                INNER JOIN
            project p ON p.project_id = pr.subject_project_id
        WHERE
            pr.object_project_id = 2007
                AND name LIKE '%PLOTDATA');
          

SET @sql = CONCAT('SELECT 
    nde.nd_experiment_id,
    gl.description AS TRIAL_INSTANCE,
    (SELECT 
            iispcvt.definition
        FROM
            stockprop isp
                INNER JOIN
            cvterm ispcvt ON ispcvt.cvterm_id = isp.type_id
                INNER JOIN
            cvterm iispcvt ON iispcvt.cvterm_id = isp.value
        WHERE
            isp.stock_id = s.stock_id
                AND ispcvt.name = ''ENTRY_TYPE'') ENTRY_TYPE,
    s.dbxref_id AS GID,
    s.name DESIGNATION,
    s.uniquename ENTRY_NO,
    (SELECT 
            isp.value
        FROM
            stockprop isp
                INNER JOIN
            cvterm ispcvt1 ON ispcvt1.cvterm_id = isp.type_id
        WHERE
            isp.stock_id = s.stock_id
                AND ispcvt1.name = ''SEED_SOURCE'') SEED_SOURCE,
    (SELECT 
            ndep.value
        FROM
            nd_experimentprop ndep
                INNER JOIN
            cvterm ispcvt ON ispcvt.cvterm_id = ndep.type_id
        WHERE
            ndep.nd_experiment_id = ep.nd_experiment_id
                AND ispcvt.name = ''REP_NO'') REP_NO,
    (SELECT 
            ndep.value
        FROM
            nd_experimentprop ndep
                INNER JOIN
            cvterm ispcvt ON ispcvt.cvterm_id = ndep.type_id
        WHERE
            ndep.nd_experiment_id = ep.nd_experiment_id
                AND ispcvt.name = ''PLOT_NO'') PLOT_NO,',@sql,

 '               FROM
    Project p
        INNER JOIN
    project_relationship pr ON p.project_id = pr.subject_project_id
        INNER JOIN
    nd_experiment_project ep ON pr.subject_project_id = ep.project_id
        INNER JOIN
    nd_experiment nde ON nde.nd_experiment_id = ep.nd_experiment_id
        INNER JOIN
    nd_geolocation gl ON nde.nd_geolocation_id = gl.nd_geolocation_id
        INNER JOIN
    nd_experiment_stock es ON ep.nd_experiment_id = es.nd_experiment_id
        INNER JOIN
    Stock s ON s.stock_id = es.stock_id
                LEFT JOIN
    nd_experiment_phenotype neph ON neph.nd_experiment_id = nde.nd_experiment_id
            LEFT JOIN
    phenotype ph ON neph.phenotype_id = ph.phenotype_id
           LEFT JOIN
    cvterm cvterm_variable ON cvterm_variable.cvterm_id = ph.observable_id   
  
WHERE
    p.project_id = (Select p.project_id from project_relationship pr
INNER JOIN project p on p.project_id = pr.subject_project_id
where (pr.object_project_id = 2007 and name LIKE ''%PLOTDATA''))
GROUP BY nde.nd_experiment_id, TRIAL_INSTANCE, ENTRY_TYPE, SEED_SOURCE, REP_NO, PLOT_NO
	
');
                

PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
 
# SHOW VARIABLES;
#

Obviously, this query can be further optimised but provides a flavour of what's possible by optimising our query.

Advance

Advance Nursery Sequence Diagram

Advancing is hampered only by the unique name checking for Germplasm as is required in most systems. These candidate names should first be 'allocated' from a Name Registry that can discern for a batch of records up front as to whether a range of names are unique.

Germplasm Search

Germplasm Search Sequence Diagram

Key Stats

Total Germplasm Search took 2513 ms
Count Query took - 50 ms
Search for germplasm took - 38 ms
Pedigree Generation Took - 1853 ms i.e. 73% percent of the total processing time.

Recommendation

Optimise Pedigree String retrieval be storing the pedigree string
- Lazy generation of pedigree string.
- Batch processor to update pedigree string when Pedigree string is invalidated.
Possible database optimisations by moving the selection history out of the germplasm table.
Alternative approaches to storing ancestry tree

Interesting finding

Retrieval of the pedigree tree could have been faster if MySQL supported recursive queries. Unfortunately, MySql does not. We can work around this with using a stored procedure.

# Usage: the standard syntax:
#   WITH RECURSIVE recursive_table AS
#    (initial_SELECT
#     UNION ALL
#     recursive_SELECT)
#   final_SELECT;
# should be translated by you to 
# CALL WITH_EMULATOR(recursive_table, initial_SELECT, recursive_SELECT,
#                    final_SELECT, 0, "").

# ALGORITHM:
# 1) we have an initial table T0 (actual name is an argument
# "recursive_table"), we fill it with result of initial_SELECT.
# 2) We have a union table U, initially empty.
# 3) Loop:
#   add rows of T0 to U,
#   run recursive_SELECT based on T0 and put result into table T1,
#   if T1 is empty
#      then leave loop,
#      else swap T0 and T1 (renaming) and empty T1
# 4) Drop T0, T1
# 5) Rename U to T0
# 6) run final select, send relult to client

# This is for *one* recursive table.
# It would be possible to write a SP creating multiple recursive tables.

delimiter |

CREATE PROCEDURE WITH_EMULATOR(
recursive_table varchar(100), # name of recursive table
initial_SELECT text, # seed a.k.a. anchor
recursive_SELECT text, # recursive member
final_SELECT text, # final SELECT on UNION result
max_recursion int unsigned, # safety against infinite loop, use 0 for default
create_table_options text # you can add CREATE-TABLE-time options
# to your recursive_table, to speed up initial/recursive/final SELECTs; example:
# "(KEY(some_column)) ENGINE=MEMORY"
)

BEGIN
  declare new_rows int unsigned;
  declare show_progress int default 0; # set to 1 to trace/debug execution
  declare recursive_table_next varchar(120);
  declare recursive_table_union varchar(120);
  declare recursive_table_tmp varchar(120);
  set recursive_table_next  = concat(recursive_table, "_next");
  set recursive_table_union = concat(recursive_table, "_union");
  set recursive_table_tmp   = concat(recursive_table, "_tmp"); 
  # Cleanup any previous failed runs
  SET @str =
    CONCAT("DROP TEMPORARY TABLE IF EXISTS ", recursive_table, ",",
    recursive_table_next, ",", recursive_table_union,
    ",", recursive_table_tmp);
  PREPARE stmt FROM @str;
  EXECUTE stmt; 
 # If you need to reference recursive_table more than
  # once in recursive_SELECT, remove the TEMPORARY word.
  SET @str = # create and fill T0
    CONCAT("CREATE TEMPORARY TABLE ", recursive_table, " ",
    create_table_options, " AS ", initial_SELECT);
  PREPARE stmt FROM @str;
  EXECUTE stmt;
  SET @str = # create U
    CONCAT("CREATE TEMPORARY TABLE ", recursive_table_union, " LIKE ", recursive_table);
  PREPARE stmt FROM @str;
  EXECUTE stmt;
  SET @str = # create T1
    CONCAT("CREATE TEMPORARY TABLE ", recursive_table_next, " LIKE ", recursive_table);
  PREPARE stmt FROM @str;
  EXECUTE stmt;
  if max_recursion = 0 then
    set max_recursion = 100; # a default to protect the innocent
  end if;
  recursion: repeat
    # add T0 to U (this is always UNION ALL)
    SET @str =
      CONCAT("INSERT INTO ", recursive_table_union, " SELECT * FROM ", recursive_table);
    PREPARE stmt FROM @str;
    EXECUTE stmt;
    # we are done if max depth reached
    set max_recursion = max_recursion - 1;
    if not max_recursion then
      if show_progress then
        select concat("max recursion exceeded");
      end if;
      leave recursion;
    end if;
    # fill T1 by applying the recursive SELECT on T0
    SET @str =
      CONCAT("INSERT INTO ", recursive_table_next, " ", recursive_SELECT);
    PREPARE stmt FROM @str;
    EXECUTE stmt;
    # we are done if no rows in T1
    select row_count() into new_rows;
    if show_progress then
      select concat(new_rows, " new rows found");
    end if;
    if not new_rows then
      leave recursion;
    end if;
    # Prepare next iteration:
    # T1 becomes T0, to be the source of next run of recursive_SELECT,
    # T0 is recycled to be T1.
    SET @str =
      CONCAT("ALTER TABLE ", recursive_table, " RENAME ", recursive_table_tmp);
    PREPARE stmt FROM @str;
    EXECUTE stmt;
    # we use ALTER TABLE RENAME because RENAME TABLE does not support temp tables
    SET @str =
      CONCAT("ALTER TABLE ", recursive_table_next, " RENAME ", recursive_table);
    PREPARE stmt FROM @str;
    EXECUTE stmt;
    SET @str =
      CONCAT("ALTER TABLE ", recursive_table_tmp, " RENAME ", recursive_table_next);
    PREPARE stmt FROM @str;
    EXECUTE stmt;
    # empty T1
    SET @str =
      CONCAT("TRUNCATE TABLE ", recursive_table_next);
    PREPARE stmt FROM @str;
    EXECUTE stmt;
  until 0 end repeat;
  # eliminate T0 and T1
  SET @str =
    CONCAT("DROP TEMPORARY TABLE ", recursive_table_next, ", ", recursive_table);
  PREPARE stmt FROM @str;
  EXECUTE stmt;
  # Final (output) SELECT uses recursive_table name
  SET @str =
    CONCAT("ALTER TABLE ", recursive_table_union, " RENAME ", recursive_table);
  PREPARE stmt FROM @str;
  EXECUTE stmt;
  # Run final SELECT on UNION
  SET @str = final_SELECT;
  PREPARE stmt FROM @str;
  EXECUTE stmt;
  # No temporary tables may survive:
  SET @str =
    CONCAT("DROP TEMPORARY TABLE ", recursive_table);
  PREPARE stmt FROM @str;
  EXECUTE stmt;
  # We are done :-)
END|

delimiter ;

Example usage with the germplasm table

CALL WITH_EMULATOR (
	"germplsm_recursive",
	"
	  SELECT *
	  FROM germplsm g 
	  where g.gid = 500012",
	"
	  SELECT kid.* from germplsm kid
	  INNER join germplsm p ON p.gpid1 = kid.gid or p.gpid2 = kid.gid
	  where p.gid in (Select gid from germplsm_recursive) 
	",
	"
	  SELECT *
	  FROM germplsm_recursive;
	",
	10000000,
	"ENGINE=MEMORY"
) ;

This workaround would have been great but this does not work in a transactional system as the creation of tempory tables effectively kills the transactions.