Home » Uncategorized

Things to watch out for when using deep learning

Deep learning has provided the world of data science with highly effective tools that can address problems in virtually any domain, and using nearly any kind of data. However, the non-intuitive features deduced and used by deep learning algorithms require a very careful experimental design, and a failure to meet that requirement can lead to miserably flawed results, regardless of the quality of the data or the structure of the deep learning network.

I first noticed such flaws almost ten years ago, when I applied algorithms that used non-intuitive features for the purpose of automatic face recognition. I noticed that when using the most common face recognition benchmarks at that time (FERET, ORL, YaleB, JAFFE, and others), the algorithms could identify the correct face even when using just a small seemingly blank part of the background, normally a small sub-image from the top-left corner of the original image, that does not contain any part of the face, hair, clothes, or anything else that could allow the recognition of a person (1).

I ran the experiments like they were intended, but instead of using the full face images I used a very small part of the background taken from the top-left corner of each image. The algorithms were able to identify the faces in very high accuracy, sometimes as high as 100%, even though no faces were in the images that were analyzed. In other words, the algorithms performed face recognition without faces.

Things to watch out for when using deep learning

The top left 100×100 pixels from the face images of the first 10 subjects of the FERET face recognition dataset. No face, hair or clothes were in the images, but algorithms could still recognize the “face” (1).

Face recognition without a face is clearly something that is not possible, which means that something in the experimental design must have gone wrong. The source of the problem was probably the data acquisition process, in which for the sake of convenience of the human subjects the set of photos of each person were acquired in one batch. Therefore, subtle changes in the lighting condition, the position of the camera, or even the temperature of the CCD when the photo was taken could lead to differences that might not be noticeable to the naked eye, but deep learning algorithms could identify them and classify these images, providing very good face recognition accuracy yet without any proof that the images are being classified by the faces, or that the network can indeed recognize faces. Needless to mention that thousands of scientific papers were published based on these datasets.

Similar observations were made also with automatic object recognition datasets, where deep learning has demonstrated significant improvement in problems such as ImageNet and other similar datasets. Just using a very small part of each image that does not allow the recognition of the object or scene led to very good automatic classification accuracy using many of the common object recognition datasets (2).

Things to watch out for when using deep learning

The 20×20 pixel sub-images from the bottom right corner of the first five objects of NEC Animals dataset. There is no information in the images that can identify an animal, but algorithms were still able to classify the images correctly (2).

The same happened not just with image data, but also with audio data (3). Experiments with non-intuitive features of automatic accent identification were replicated with very high accuracy, when just using the first 0.5 seconds of each recording sample. That 0.5 second does not contain any audible information, but due to the background noise could identify the correct “accent” even without any recording of accent information (3).

While these experiments were designed by computer scientists and engineers, one might assume that biologists are more experienced in sound experimental design. However, a quick look into some of the most basic experiments in bioimage informatics show the exact same problem: Experiments in automatic analysis of microscopy images of cells could be replicated after removing all cells from the images. Again, the experiments led to the same results regardless of the cells in the images, showing that the analysis is driven by the background and not the biological content (4).

Things to watch out for when using deep learning

Whether the images contained cells or just white rectangles, automatic cell recognition algorithms provided virtually identical results.

The use non-intuitive features can lead to results that might at first seem to solve a certain problem, but in fact provide no reliable evidence that the problem is in fact solved. These results confused not just the novices, but in fact mislead a very large number of experienced researchers who have deep understanding in data analysis and experimental design.

Therefore, when using non-intuitive features the design must be extremely careful, with solid controls, and no assumptions can be made about the data. For instance, in the face recognition example the data should have been collected one sample at a time, rather than several samples in a single batch, with the assumption that acquiring several samples in a batch is equivalent to the acquisition of one sample in several different acquisition sessions.

The common practice of using cross-validation in machine learning can also introduce some risks. If the training samples are not collected independently of the test samples, cross-validation might show good signal that is driven by the data acquisition process rather than the problem itself. These considerations must be examined very carefully when using machine learning with non-intuitive features.


  1. Shamir, L., Evaluation of face datasets as tools for assessing the performance of face recognition methods, International Journal of Computer Vision, 79(3), 225-230, 2008.
  2. Model, I., Shamir, L., Comparison of dataset bias in object recognition benchmarks, IEEE Access, 3(1), 1953-1962, 2015.
  3. Bock, B., Shamir, L., Assessing the efficacy of benchmarks for automatic speech accent recognition, 8th International Conference on Mobile Multimedia Communications, 133-136, 2015.
  4. Shamir, L., Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis, Journal of Microscopy, 243(3), 284-292, 2011.