Experiments on MNIST - accosmin/nano GitHub Wiki

Setup

[01-06-2017] We have performed some experiments on MNIST to evaluate the impact of various components:

model: do more layers help? a convolution network is more appropriate than a MLP?
trainer: what is the variance of the test error for all supported batch & stochastic optimizers?
loss function: is using the negative log-likelihood loss better than using the logistic loss?
iterator: adding noise or randomly warping the training samples improves performance?
activations funtions: is pwave actually better than snorm or tanh?

Each configuration was repeated 10 times and the average, variance and the best and work test results were displayed. More details can be found by examining the Python scripts in scripts/experiment_mnist_*.py.

To reproduce these results checkout this version of the library:

./apps/info --git-hash 
6f09d99d86ece82e65d19a63f192cfb2acdf15e2
./apps/info --version
0.4

Model evaluation

cd scripts && python3 experiment_mnist_models.py
|----------|----------------|----------|----------|--------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------------------------------------|
| model    | trainer        | iterator | loss     | test value                     | test error                     | epochs                            | convergence speed              | duration (sec)                                |
|----------|----------------|----------|----------|--------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------------------------------------|
| convnet1 | stoch_adadelta | default  | classnll | 0.0491+/-0.0032[0.0470,0.0577] | 0.0159+/-0.0008[0.0149,0.0175] | 24.6000+/-1.6465[22.0000,27.0000] | 0.9374+/-0.0045[0.9261,0.9419] | 3603.3422+/-2.5635[3600.2790,3607.5740]       |
| convnet2 | stoch_adadelta | default  | classnll | 0.0337+/-0.0022[0.0310,0.0388] | 0.0111+/-0.0008[0.0102,0.0125] | 34.4000+/-6.5013[25.0000,45.0000] | 0.9766+/-0.0014[0.9751,0.9790] | 19107.9697+/-2423.3408[14454.9850,21646.6260] |
| convnet3 | stoch_adadelta | default  | classnll | 0.0336+/-0.0025[0.0288,0.0367] | 0.0108+/-0.0005[0.0100,0.0115] | 28.0000+/-4.1366[22.0000,34.0000] | 0.9751+/-0.0016[0.9725,0.9779] | 19826.0114+/-1886.7604[18019.0510,21635.0370] |
| convnet4 | stoch_adadelta | default  | classnll | 0.0371+/-0.0015[0.0348,0.0393] | 0.0114+/-0.0008[0.0099,0.0123] | 26.9000+/-5.4863[19.0000,39.0000] | 0.9734+/-0.0032[0.9669,0.9793] | 21625.9774+/-2394.2700[18033.3940,25250.4140] |
| mlp0     | stoch_adadelta | default  | classnll | 0.2688+/-0.0016[0.2659,0.2721] | 0.0750+/-0.0006[0.0740,0.0760] | 67.6000+/-9.0946[56.0000,84.0000] | 0.9743+/-0.0012[0.9725,0.9771] | 26.0135+/-1.7388[24.7220,30.2830]             |
| mlp1     | stoch_adadelta | default  | classnll | 0.0703+/-0.0030[0.0656,0.0760] | 0.0217+/-0.0009[0.0202,0.0229] | 53.0000+/-4.1899[46.0000,60.0000] | 0.9893+/-0.0003[0.9889,0.9898] | 29557.7306+/-2265.5117[25259.3040,32439.7630] |
| mlp2     | stoch_adadelta | default  | classnll | 0.0705+/-0.0022[0.0669,0.0748] | 0.0219+/-0.0010[0.0208,0.0242] | 32.9000+/-2.4698[28.0000,36.0000] | 0.9839+/-0.0004[0.9834,0.9847] | 26306.1946+/-1723.9888[25207.5390,28809.3390] |
| mlp3     | stoch_adadelta | default  | classnll | 0.0753+/-0.0030[0.0702,0.0801] | 0.0221+/-0.0014[0.0203,0.0243] | 28.0000+/-6.6165[18.0000,39.0000] | 0.9831+/-0.0033[0.9804,0.9894] | 29182.4965+/-3948.8050[25202.3350,36001.8940] |
| mlp4     | stoch_adadelta | default  | classnll | 0.0819+/-0.0053[0.0726,0.0912] | 0.0236+/-0.0012[0.0210,0.0252] | 24.5000+/-5.0166[19.0000,35.0000] | 0.9831+/-0.0038[0.9794,0.9901] | 31705.7371+/-2836.3995[28823.5250,36056.7990] |
|----------|----------------|----------|----------|--------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------------------------------------|

Conclusion

convolution networks perform much better than fully-connected layers (even a single convolution layer greatly decreases the error rate)
this is expected for image classification tasks where adjacent pixels are correlated
also convolution networks use fewer parameters, thus overfitting is less of a problem

Trainer evaluation

cd scripts && python3 experiment_mnist_trainers.py
|----------|----------------|----------|----------|--------------------------------|--------------------------------|--------------------------------------|--------------------------------|--------------------------------------------------|
| model    | trainer        | iterator | loss     | test value                     | test error                     | epochs                               | convergence speed              | duration (sec)                                   |
|----------|----------------|----------|----------|--------------------------------|--------------------------------|--------------------------------------|--------------------------------|--------------------------------------------------|
| convnet4 | batch_cgd      | default  | classnll | 0.0592+/-0.0084[0.0482,0.0781] | 0.0187+/-0.0021[0.0162,0.0237] | 99.9000+/-0.3162[99.0000,100.0000]   | 0.9943+/-0.0004[0.9935,0.9949] | 40350.8240+/-1505.0732[39614.7320,43207.6710]    |
| convnet4 | batch_gd       | default  | classnll | 0.2417+/-0.0530[0.1803,0.3419] | 0.0666+/-0.0131[0.0503,0.0879] | 100.0000+/-0.0000[100.0000,100.0000] | 0.9974+/-0.0004[0.9966,0.9980] | 114503.4713+/-33112.1426[79210.1300,194400.7050] |
| convnet4 | batch_lbfgs    | default  | classnll | 0.0520+/-0.0058[0.0478,0.0668] | 0.0151+/-0.0016[0.0138,0.0185] | 93.7000+/-5.8699[82.0000,100.0000]   | 0.9838+/-0.0016[0.9819,0.9862] | 18029.9128+/-13.0436[18000.3170,18051.6930]      |
| convnet4 | stoch_adadelta | default  | classnll | 0.0363+/-0.0031[0.0311,0.0404] | 0.0109+/-0.0009[0.0099,0.0127] | 29.2000+/-3.4897[24.0000,36.0000]    | 0.9753+/-0.0021[0.9718,0.9787] | 22714.0031+/-1744.1755[21603.8390,25254.0210]    |
| convnet4 | stoch_adagrad  | default  | classnll | 0.0378+/-0.0021[0.0340,0.0416] | 0.0120+/-0.0008[0.0109,0.0134] | 75.5000+/-13.3437[50.0000,97.0000]   | 0.9936+/-0.0004[0.9927,0.9942] | 42489.0094+/-3714.9598[32417.4480,46805.4240]    |
| convnet4 | stoch_adam     | default  | classnll | 0.0422+/-0.0032[0.0352,0.0457] | 0.0132+/-0.0010[0.0115,0.0146] | 47.4000+/-3.8930[42.0000,54.0000]    | 0.9887+/-0.0005[0.9877,0.9896] | 72394.8855+/-4305.2067[68429.0480,79259.5130]    |
| convnet4 | stoch_ag       | default  | classnll | 0.1167+/-0.0104[0.1052,0.1357] | 0.0343+/-0.0037[0.0300,0.0416] | 4.5000+/-0.8498[3.0000,6.0000]       | 0.9913+/-0.0019[0.9881,0.9938] | 3638.4053+/-4.3553[3633.0380,3645.4210]          |
| convnet4 | stoch_agfr     | default  | classnll | 0.0364+/-0.0020[0.0331,0.0403] | 0.0113+/-0.0006[0.0105,0.0122] | 27.6000+/-2.9515[24.0000,32.0000]    | 0.9835+/-0.0011[0.9816,0.9848] | 23053.2341+/-1852.6145[21603.0270,25212.4940]    |
| convnet4 | stoch_aggr     | default  | classnll | 0.0359+/-0.0024[0.0325,0.0399] | 0.0114+/-0.0008[0.0100,0.0124] | 49.5000+/-5.3800[42.0000,56.0000]    | 0.9881+/-0.0009[0.9869,0.9897] | 28834.9832+/-2932.8482[25240.9690,32432.8320]    |
| convnet4 | stoch_asgd     | default  | classnll | 0.1052+/-0.0063[0.0981,0.1170] | 0.0300+/-0.0020[0.0273,0.0333] | 100.0000+/-0.0000[100.0000,100.0000] | 0.9978+/-0.0001[0.9977,0.9979] | 54372.4635+/-1138.1578[54002.7710,57611.6330]    |
| convnet4 | stoch_ngd      | default  | classnll | 0.1351+/-0.0074[0.1244,0.1489] | 0.0381+/-0.0028[0.0343,0.0437] | 100.0000+/-0.0000[100.0000,100.0000] | 0.9977+/-0.0001[0.9976,0.9978] | 43226.4358+/-7.9474[43217.3190,43247.1510]       |
| convnet4 | stoch_rmsprop  | default  | classnll | 0.0412+/-0.0032[0.0371,0.0466] | 0.0128+/-0.0009[0.0117,0.0150] | 49.3000+/-8.3938[44.0000,72.0000]    | 0.9896+/-0.0011[0.9882,0.9920] | 72393.4876+/-4302.8101[68443.3620,82821.4090]    |
| convnet4 | stoch_sg       | default  | classnll | 0.1039+/-0.0047[0.0963,0.1118] | 0.0298+/-0.0017[0.0269,0.0322] | 100.0000+/-0.0000[100.0000,100.0000] | 0.9978+/-0.0001[0.9977,0.9979] | 43230.1251+/-9.8925[43221.7790,43251.2870]       |
| convnet4 | stoch_sgm      | default  | classnll | 0.1083+/-0.0059[0.1006,0.1184] | 0.0308+/-0.0013[0.0284,0.0327] | 100.0000+/-0.0000[100.0000,100.0000] | 0.9979+/-0.0001[0.9978,0.9980] | 54378.1816+/-1146.8653[54000.4740,57641.9900]    |
| convnet4 | stoch_svrg     | default  | classnll | 0.1043+/-0.0044[0.0983,0.1095] | 0.0303+/-0.0013[0.0283,0.0321] | 100.0000+/-0.0000[100.0000,100.0000] | 0.9984+/-0.0000[0.9984,0.9985] | 65538.8344+/-2279.6746[64810.7410,72026.8760]    |
|----------|----------------|----------|----------|--------------------------------|--------------------------------|--------------------------------------|--------------------------------|--------------------------------------------------|

Conclusion

generally stochastic optimization algorithms converge faster to better solutions (e.g. AdaDelta reaches 1% error rate in 30 epochs, while l-BFGS needs around 100 epochs to reach 1.5% error rate)
the variance of the error rate is large for most methods, due to the non-convexity of the loss functions, but it doesn't explain why it is relatively small for some others like AdaDelta
AdaDelta produces the best average error rate also with one of the smallest variance and with the fastest convergence rate
second best results are obtained with gradient (AGGR) and function value restarts (AGFR) variations of Nesterov's accelerated gradient descent (AG)
these results need more analysis

Loss function evaluation

cd scripts && python3 experiment_mnist_losses.py
|----------|----------------|----------|--------------|--------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------------------------------------|
| model    | trainer        | iterator | loss         | test value                     | test error                     | epochs                            | convergence speed              | duration (sec)                                |
|----------|----------------|----------|--------------|--------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------------------------------------|
| convnet4 | stoch_adadelta | default  | classnll     | 0.0357+/-0.0023[0.0310,0.0385] | 0.0112+/-0.0008[0.0096,0.0124] | 23.6000+/-4.0332[17.0000,30.0000] | 0.9725+/-0.0016[0.9698,0.9755] | 20183.8252+/-1849.6577[18018.5640,21643.8510] |
| convnet4 | stoch_adadelta | default  | sexponential | 0.5650+/-0.1425[0.4287,0.8851] | 0.0229+/-0.0058[0.0176,0.0360] | 10.4000+/-4.1687[4.0000,17.0000]  | 0.9702+/-0.0066[0.9570,0.9779] | 15147.5868+/-1509.9022[14403.7450,18017.6040] |
| convnet4 | stoch_adadelta | default  | slogistic    | 0.0733+/-0.0054[0.0661,0.0838] | 0.0097+/-0.0010[0.0083,0.0111] | 31.3000+/-4.1110[27.0000,37.0000] | 0.9766+/-0.0015[0.9746,0.9787] | 23421.1597+/-1893.0941[21622.0450,25233.2690] |
|----------|----------------|----------|--------------|--------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------------------------------------|

Conclusion

the logistic loss produces significantly better results than either using the exponential loss or the class negative log-likelihood loss
why?!

Iterator evaluation

cd scripts && python3 experiment_mnist_iterators.py
|----------|----------------|----------|----------|--------------------------------|--------------------------------|------------------------------------|--------------------------------|-----------------------------------------------|
| model    | trainer        | iterator | loss     | test value                     | test error                     | epochs                             | convergence speed              | duration (sec)                                |
|----------|----------------|----------|----------|--------------------------------|--------------------------------|------------------------------------|--------------------------------|-----------------------------------------------|
| convnet4 | stoch_adadelta | default  | classnll | 0.0355+/-0.0024[0.0315,0.0391] | 0.0109+/-0.0008[0.0101,0.0126] | 25.6000+/-3.5024[19.0000,31.0000]  | 0.9737+/-0.0016[0.9695,0.9751] | 20552.3711+/-1729.0942[18030.7770,21656.0240] |
| convnet4 | stoch_adadelta | noise05  | classnll | 0.0352+/-0.0026[0.0316,0.0400] | 0.0108+/-0.0008[0.0098,0.0123] | 28.2000+/-4.0770[20.0000,35.0000]  | 0.9746+/-0.0018[0.9717,0.9770] | 21995.4170+/-1133.3705[21606.2300,25220.7970] |
| convnet4 | stoch_adadelta | noise10  | classnll | 0.0371+/-0.0025[0.0327,0.0405] | 0.0114+/-0.0007[0.0104,0.0124] | 29.9000+/-6.7404[19.0000,43.0000]  | 0.9757+/-0.0030[0.9713,0.9820] | 23066.1760+/-3029.6966[18034.2240,28824.1490] |
| convnet4 | stoch_adadelta | noise20  | classnll | 0.0330+/-0.0027[0.0298,0.0381] | 0.0102+/-0.0006[0.0092,0.0109] | 32.4000+/-4.9261[24.0000,40.0000]  | 0.9794+/-0.0006[0.9785,0.9805] | 24145.6745+/-2416.2109[21626.0290,28808.1970] |
| convnet4 | stoch_adadelta | noise50  | classnll | 0.0299+/-0.0021[0.0257,0.0322] | 0.0090+/-0.0008[0.0079,0.0106] | 71.6000+/-12.1582[55.0000,95.0000] | 0.9896+/-0.0004[0.9891,0.9905] | 41063.0737+/-3852.7101[32454.9300,46803.6670] |
| convnet4 | stoch_adadelta | noise99  | classnll | 0.0326+/-0.0020[0.0293,0.0355] | 0.0106+/-0.0006[0.0092,0.0114] | 97.8000+/-2.8206[90.0000,100.0000] | 0.9944+/-0.0002[0.9939,0.9946] | 43565.7419+/-1137.1512[43202.4660,46802.1210] |
| convnet4 | stoch_adadelta | warp     | classnll | 0.0247+/-0.0024[0.0218,0.0301] | 0.0076+/-0.0008[0.0062,0.0088] | 83.1000+/-9.6200[60.0000,96.0000]  | 0.9898+/-0.0006[0.9889,0.9908] | 44313.1111+/-2411.5104[39631.1780,46814.7200] |
|----------|----------------|----------|----------|--------------------------------|--------------------------------|------------------------------------|--------------------------------|-----------------------------------------------|

Conclusion

adding noise to the training images does not improve the error rate
randomly warping the training images (following http://leon.bottou.org/projects/infimnist) results in much better performance as the error rate decreases from 1% to 0.75%

Activation functions evaluation

cd scripts && python3 experiment_mnist_activations.py
|--------------------|----------------|----------|----------|--------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------------------------------------|
| model              | trainer        | iterator | loss     | test value                     | test error                     | epochs                            | convergence speed              | duration (sec)                                |
|--------------------|----------------|----------|----------|--------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------------------------------------|
| convnet4_act_pwave | stoch_adadelta | default  | classnll | 0.0324+/-0.0022[0.0277,0.0353] | 0.0101+/-0.0007[0.0092,0.0114] | 23.7000+/-3.5917[18.0000,29.0000] | 0.9879+/-0.0015[0.9857,0.9910] | 46831.0034+/-6784.5308[39631.1760,61243.2470] |
| convnet4_act_sin   | stoch_adadelta | default  | classnll | 0.0401+/-0.0030[0.0367,0.0451] | 0.0120+/-0.0009[0.0110,0.0135] | 17.5000+/-3.4075[15.0000,23.0000] | 0.9868+/-0.0013[0.9851,0.9895] | 47185.8255+/-3583.7582[43200.1500,54016.6260] |
| convnet4_act_snorm | stoch_adadelta | default  | classnll | 0.0369+/-0.0023[0.0322,0.0400] | 0.0107+/-0.0011[0.0094,0.0127] | 31.4000+/-6.0773[26.0000,47.0000] | 0.9893+/-0.0019[0.9869,0.9920] | 57635.9678+/-9141.5537[46859.7240,72046.1430] |
| convnet4_act_tanh  | stoch_adadelta | default  | classnll | 0.0369+/-0.0025[0.0345,0.0434] | 0.0113+/-0.0009[0.0105,0.0135] | 25.9000+/-2.6854[23.0000,31.0000] | 0.9883+/-0.0018[0.9857,0.9918] | 51866.6445+/-8182.5230[43210.2730,72053.8780] |
|--------------------|----------------|----------|----------|--------------------------------|--------------------------------|-----------------------------------|--------------------------------|-----------------------------------------------|

Conclusion

using pwave and snorm activation functions produce significantly better results than using tanh or sin
one can motivate that using snorm or pwave does not saturate as fast as tanh
one can motivate that using pwave reduces the L2-norm of the parameters so overfitting is delayed
these results need more analysis