Comparison of the Top Cloud APIs for Computer Vision

What is computer vision?

Nowadays, computer vision (CV) is one of the most widely used dimensions of machine learning. The main task of computer vision is to understand the contents of the image. It is used almost in all spheres of the modern technology such as image and video classification, content filtering, security, face detection which even resides in your smartphone’s camera app.

The computer vision field evolves continuously. Building a model for visual recognition is both difficult and time-consuming task. Fortunately, there are a lot of ready-to-use solutions on the market. They are developed by the various companies like Google, Microsoft, IBM, Amazon, and others. These solutions are namely provided as APIs which you may integrate your apps with.

In this article, we are going to make a brief overview of these Cloud APIs capabilities.

Google

Google is one of the most renowned companies in the field of machine learning. It provides lots of cloud computing services as APIs for computer vision. The Vision API helps your application to understand what is in the image, classifying the content into the known categories and providing the labels.

It is also capable of detecting landmarks – e.g. buildings, monuments, natural structures, or logos, performing character recognition that supports a wide variety of languages. The facial detection allows detecting a face with the person’s emotion and headwear. Unfortunately, there is no support for facial recognition. On top of that, you may use the API to search for the similar images on the web and to filter explicit or violent content.

Google also assures the Video Intelligence to perform video analysis, classification, and labeling. This allows searching through the videos based on the extracted metadata. It is also possible to detect the change of the scene and filter the explicit content.

All these features are available through a REST API for easy integration.

Microsoft

Microsoft Azure is another cloud computing service. Azure provides several computer vision services. They are wrapped as different APIs – Computer Vision API for general purpose CV tasks, Face API for face detection and recognition, Content Moderator for filtering, and some more which are yet in Preview status. Let’s take a closer look.

The Computer Vision API allows classifying the image content by providing a comprehensive list of tags and attempting to build a natural language description of the scene. Also, the API is capable of recognizing the celebrities and landmarks.

Another feature is Optical Character Recognition (OCR) of printed text and as a preview. The OCR for the handwritten texts is also available, but yet only for the English language.

The Face API is used to detect the faces in the images and retrieve the bounding rectangles and facial features like emotional state, gender, age, facial hair, smile score and facial landmarks. One more feature is Face Recognition which helps to understand who the person is matching against a database. This feature may be useful for security. Another one is similar to face finding, which finds a list of faces which look similar to the input face.

The Content Moderator could be used for video and image filtering. The unwanted content is filtered using machine learning based classifiers and optical character recognition.

The Video Indexer and Custom Vision Service are yet available as a preview. The Video Indexer is used for insight extraction from the videos. It is capable of sentiment analysis, keyword and metadata extraction, and people detection. The Custom Vision Service allows creating fine-tuned computer vision models for a specific use case. This service is capable of incremental learning – your model will improve over time with each image supplied.

Amazon

The Amazon Rekognition is a Computer Vision service developed by Amazon. It has deep learning at its core and seamless integration with other Amazon services. It is provided as an API for both images and videos. Rekognition can understand what objects and people are in the scene and what is happening. It may work as a content filter for adult content. In addition, it can understand the text in the image.

One of Rekognition’s powers is the ability to detect, recognize and identify people. It is capable of accurate identification of a person in a photo and video using a private dataset of face images. Or, it can recognize famous people in your imagery. It is also able to analyze the sentiment, age, eye- and headwear presence, facial hair and other features. For the videos, it is possible to track the change of these features over time. The Rekognition allows tracking people along the video even if they turn away from the camera or leave the scene.

Clarifai

Clarifai is another comparatively young company that provides Computer Vision as a service. Clarifai is solely working with CV having lots of different features available. Each specific task is solved by the corresponding model. Some of the models, yet, are in Beta state and improve continually. For example, there is a model that detects faces in the image. There is a peculiar model for each of the following parameters: age, gender, ethnicity prediction, or celebrities recognition.

The General model is the most versatile. It is able to understand present objects in the image, the theme and even more. It may be used for any image analysis. You can also build your own model and train it on your images for the best results.

Clarifai also provides some narrow-use models like pattern identification, wedding-related, travel-related model, dominant color detection model. There is a model for cloth and accessory identification, another one for food identification, and for logo or brand names.

There are two models for embeddings of either faces or general items. They are based on Face detection and General models respectively. The embeddings allow taking low-level control of the machine learning process.

There is also a model to check if the image contains unsafe content like drugs or nudity.

IBM Watson Visual Recognition

IBM Watson Visual Recognition does not have that many models bundled in, but it allows building a custom one. The default one is a general model to understand what objects are in the image, identify the color theme. Another one is for the face detection (not recognition), one is for food detection, and the OCR is in private beta at the moment. The API also allows you to export the model in Core ML (Apple iOS) compatible format.

Kairos

Kairos is all about Face Detection. With the help of their products, it is possible to detect faces in either photos or videos, to identify and verify people. Via Kairos, you may detect the emotional state, age group (e.g. child, young, adult, senior), the gender of a person, facial features like eyes, eyebrows, etc.

Kairos is available either as a Cloud API or as SDK for offline integrations.

Overall comparison

For your convenience, we prepared a table with a quick overall comparison of the most popular Cloud APIs for computer vision highlighting their main features.

Updated: August 2018

Conclusions

There are many different cloud APIs for computer vision on the market. In addition, this field is under rapid development. In the article, we made a brief overview of the various providers. At first sight, all of them provide fairly similar capabilities, yet some put an emphasis on face recognition like Kairos, or on building custom models like IBM and Azure.

However, if you need to accomplish some very specific task, you still have to build the model using Deep Learning frameworks yourself.