Students at Delft University of Technology, the Netherlands carried out a crowdsourcing study as part of the Crowd Computing Course designed by Asst. Prof. Ujwal Gadiraju and Prof. Alessandro Bozzon around one key challenge – the creation and consumption of (high quality) data. Course participants presented several brilliant group projects at the Crowd Computing Showcase event held on 06.07.2021. The group consisting of Xinyue Chen, Dina Chen, Siwei Wang, Ye Yuan, and Meng Zheng was judged to be among the best. The details pertaining to this study are described below.
Saliency maps are an important aspect of Computer Vision and Machine Learning. Annotating saliency maps, like all data labeling, can be done in a variety of ways; in this case, crowdsourcing was used since it is considered to be one of the fastest methods. The goal was to obtain annotated maps that could be used to acquire a valid explanation for model classifications. Four task designs were used in the experiment.
As a first step, an ImageNet-pretrained Inception V3 model was used to extract saliency maps from original images. The maps were subsequently fine-tuned using CornellLab’s NAbirds Dataset that contains over 500 images of bird species. 11 of those were selected for the project. SmoothGrad was used to minimize noise levels.
Fig. 1 Example image of a saliency map
Four types of tasks were used in the course of the experiment: one control task that became the baseline and three experimental tasks. Those three were: training, easy tagging (ET), and training + ET. Each task consisted of 74 images that took approximately three minutes to process. Each saliency map was annotated by three different crowd workers.
Three functional requirements had to be met in this part of the experiment:
Fig. 2 Baseline interface
The performers were asked to complete a set of training tasks that were designed using Toloka, a crowdsourcing platform. A training pool with three 3-minute tasks was created. The performers had to finish all of the tasks with a minimum accuracy of 70% in order to proceed to the experimental tasks. After this was achieved, the main study began.
Task: Easy Tagging (ET)
As part of the experimental task, the crowd workers had to recognize and label various body parts of bird species. To do that, a picture was provided as a reference. Since the study group’s pilot study demonstrated that color had remained among the most common characteristics, color checkboxes were provided to make color attribute annotations easier for the subjects. In addition, all input boxes contained both “suggestion” and “free input” options, such as when the performers wished to annotate non-color attributes, or the colors provided in the answer box did not match the colors displayed in the image.
Fig. 3 Easy Tagging Interface
Quality control mechanisms were consistent across all four tasks. The performers were asked to use only desktops or laptops during the study to make sure that labeling objects with the bounding boxes was easy and done in the same way throughout. In addition, all of the subjects were required to have secondary education and be proficient in English. Captcha and fast response filtering were used to filter out dishonest workers. The answers were checked manually and accepted based on the following criteria:
Intersect Over Union was used to evaluate the accuracy of the bounding boxes. It is calculated by dividing the intersect area of two bounding boxes by the area of the union. The final IOU score is a composite average of multiple IOU values.
This metric consists of two values: entity diversity (number of distinct words), and attribute diversity (number of adjectives used to describe one entity).
This metric pertains to how complete an annotated saliency map is. It is calculated by dividing the value of the annotated saliency patches by the value of the ground truth annotations.
This metric represents a percentage of valid entity-attribute descriptions. The value is calculated by aggregating and averaging the results from three different crowd workers.
This metric is calculated by dividing the number of accepted annotations by the total number of submissions.
This metric reflects average duration values of the annotation tasks.
This metric pertains to the total number of distinct crowd workers participating in the experiment.
Two conclusions can be drawn from this study. One is that performance values depend on what type of interface is being used. In this respect, shortcuts can both help and hinder by either lifting some of the performer’s cognitive load or backfiring and making the performer too relaxed and unfocused. The second conclusion is that training can increase bounding box and description accuracy; however, it can also take away from the subject’s creativity. As a result, requesters have to consider this trade-off before making a decision regarding task design.
Certain limitations of the study should also be taken into account. The most obvious one is that this study should have ideally been conducted as a between-group experiment. Unfortunately, this was not possible. The second limitation is a small number of participants in those tasks that required training. The values received thereafter are likely to be skewed as a result. The last major limitation has to do with applicability – since only aggregated averages from across multiple granularities were used as the final values, these figures are not likely to accurately represent most non-experimental settings.
Since one of the findings suggests that input shortcuts can both increase accuracy and concurrently diminish creativity, future studies should look at different study designs with multiple shortcuts (e.g. shape and pattern). In this scenario, the negative side effect of decreased creativity and boredom may be countered with the more sophisticated interfaces that are practical and user-friendly. Finally, the authors propose a switch from written to video instructions as these will likely be more effective and result in a greater number of subjects finishing the training phase.
Project in a nutshell
Saliency maps are an integral part of ML’s advance towards improved Computer Vision. On par with other forms of data labeling, annotating saliency maps is at the core of training models and their classification. Using crowd workers from Toloka and a dataset of birds from CornellLab’s NABirds, this paper examined how crowdsourcing can be used in saliency map annotations. To do so, four types of tasks were used, of which one became the baseline, and the other three—training, easy tagging (ET), and training/ET—were the main tasks. All of the crowd performers were recruited from the Toloka crowdsourcing platform. Several metrics were used for evaluation, including IOU score, vocabulary diversity, completeness, accuracy, accept rate, and completion time among others. Results showed that the choice of interface had a major effect on performance. In addition, training increased the bounding box as well as description accuracy but also diminished the subjects’ creativity. Implications of these findings and suggestions for future studies are discussed.