Shifu 0.2.5 Stats Step Scalability Improvement - ShifuML/shifu GitHub Wiki

Stats in Shifu 0.2.4

Default stats algorithm in Shifu 0.2.4 is 'SPDT'. While with big data in 100MM records and 1800 variables, stats job is failed. The reason is last Hadoop job cannot be scaled out well.

Stats in Shifu 0.2.5

In Shifu 0.2.5, new stats algorithm 'SPDTI' can scale very well. In Shifu 0.2.4, for 22MM records, running time is 50 minutes while in Shifu 0.2.5, the number is 20minutes. 100MM records with 1800 variables are also being tested. The running time is only 30 minutes.