Evaluation - TobiasSchmidtDE/DeepL-MedicalImaging GitHub Wiki

We run multiple experiments to compare our preprocessing methods and evaluate whether they improve the performance of our models.

Augmentation:
  • we yield better results without augmentation
  • interestingly, with augmentation we get a better val AUC but a worse test AUC
  • this could maybe be due to the fact that in the test set, all pictures are centered and without a blank border, whereas in the training set (and subsequently, the val set split from that) we have unclean images
  • we could verify the worse results with augmentation by reimplementing the JF Healthcare architecture, where models trained with augmentation also performed worse
Upsampling:
  • clearly gives better performance
Uncertainty Encodings:
  • no comparison possible yet
Best Metric:
  • no comparison possible yet