Yupoo Picture Manager serves tens of millions of users and manages tens of billions of pictures. As its user gallery is growing larger, Yupoo has an urgent business need for a solution that can quickly locate the image. In other words, when a user inputs an image, the system should find its original image and similar images in the gallery. The development of the search by image service provides an effective approach to this problem.
The search by image service has undergone two evolutions:
This article describes the technology selection and basic principles behind the two generations of search by image system based on my own experience on this project.
What is an image?
We must know what is an image before dealing with images.
The answer is that an image is a collection of pixels.
For example, the part in the red box on this image is virtually a series of pixels.
Each image can be represented by a matrix. Each pixel in the image corresponds to an element in the matrix.
The pixels of a binary image is either black or white, so each pixel can be represented by 0 or 1.
For example, the matrix representation of a 4 * 4 binary image is:
0 1 0 1
1 0 0 0
1 1 1 0
0 0 1 0
The three primary colors (red, green, and blue) can be mixed to produce any color. For RGB images, each pixel has the basic information of three RGB channels. Similarly, if each channel uses an 8-bit number (in 256 levels) to represent its gray scale, then the mathematical representation of a pixel is:
([0 .. 255], [0 .. 255], [0 .. 255])
Taking a 4 * 4 RGB image as an example:
The essence of image processing is to process these pixel matrices.
If you are looking for the original image, that is, an image with exactly the same pixels, then you can directly compare their MD5 values. However, images uploaded to the Internet are often compressed or watermarked. Even a small change in an image can create a different MD5 result. As long as there is inconsistency in pixels, it is impossible to find the original image.
For a search-by-image system, we want to search for images with similar content. Then, we need to solve two basic problems:
More specifically, we need the following features:
Feature extraction — image abstraction
The first-generation search-by-image system uses Perceptual hash or pHash algorithm for feature extraction. What are the basics of this algorithm?
As shown in the figure above, the pHash algorithm performs a series of transformations on the image to get the hash value. During the transformation process, the algorithm continuously abstract images, thereby pushing the results of similar images closer to each other.
How to calculate the similarity between the pHash values of two images? The answer is to use the Hamming distance. The smaller the Hamming distance, the more similar the images’ content.
What is Hamming distance? It is the number of different bits.
Value 1： 0 1 0 1 0
Value 2： 0 0 0 1 1
There are two different bits in the above two values, so the Hamming distance between them is 2.
Now we know the principle of similarity calculation. The next question is, how to calculate the Hamming distances of 100-million-scale data from 100-million-scale pictures? In short, how to search for similar images?
In the early stage of the project, I did not find a satisfactory tool (or a computing engine) that can quickly calculate the Hamming distance. So I changed my plan.
My idea is that if the Hamming distance of two pHash values is small, then I can cut the pHash values and the corresponding small parts are likely to be equal.
Value 1： 8 a 0 3 0 3 f 6
Value 2： 8 a 0 3 0 3 d 8
We divide the above two values into eight segments and the values of six segments are exactly the same. It can be inferred that their Hamming distance is close and thus these two images are similar.
After the transformation, you can find that the problem of calculating Hamming distance has become a problem of matching equivalence. If I divide each pHash value into eight segments, as long as there are more than five segments that have exactly the same values, then the two pHash values are similar.
Thus it is very simple to solve equivalence matching. We can use the classical filtering of a traditional database system.
Of course, I use the multi-term matching and specify the degree of matching using minimum_should_match in ElasticSearch (this article does not introduce the principle of ES, you can learn it by yourself).
Why do we choose ElasticSearch? First, it provides the above-mentioned search function. Second, the image manager project in itself is using ES to provide a full-text search function and it is very economical to use the existing resources.
The first-generation search-by-image system chooses the
pHash + ElasticSearch solution, which has the following features:
But the limitation of this system is obvious: The pHash algorithm is an abstract representation of the entire image. Once we destroy the integrity of the image, such as adding a black border to the original image, it is almost impossible to judge the similarity between the original and the others.
To break through such limitations, the second-generation image search system with a completely different underlying technology emerged.
The second-generation search-by-image system technically chooses the
CNN + Milvus solution. The system is based on feature vectors and provides better technical support.
In the field of computer vision, the use of artificial intelligence has become the mainstream. Similarly, the feature extraction of the second-generation search-by-image system uses convolutional neural network (CNN) as the underlying technology.
The term CNN is difficult to understand. Here we focus on answering two questions:
There are many competitions in the AI field and image classification is one of the most important. The job of image classification is to determine whether the content of the picture is about a cat, a dog, an apple, a pear, or other types of objects.
What can CNN do? It can extract features and recognize objects. It extracts features from multiple dimensions and measures how close the features of an image are to the features of cats or dogs. We can choose the closest ones as our identification result which indicates whether the content of a specific image is about a cat, a dog, or something else.
What is the connection between the object identification function of CNN and search by image? What we want is not the final identification result, but the feature vector extracted from multiple dimensions. The feature vectors of two images with similar content must be close.
The answer is VGG16. Why choose it? First, VGG16 has good generalization capability, that is, it is very versatile. Second, the feature vectors extracted by VGG16 have 512 dimensions. If there are very few dimensions, the accuracy may be affected. If there are too many dimensions, the cost of storing and calculating these feature vectors is relatively high.
Using CNN to extract image features is a mainstream solution. We can use VGG16 as the model and Keras + TensorFlow for technical implementation. Here is the official example of Keras:
The features extracted here are feature vectors.
To facilitate subsequent operations, we often normalize feature:
What is used subsequently is also the normalized
The image is loaded using the
image.load_img method of
In fact, it is the TensorFlow method called by Keras. For details, see the TensorFlow documentation. The final image object is actually a PIL Image instance (the PIL used by TensorFlow).
In practical terms, image content is often transmitted through the network. Therefore, instead of loading images from path, we prefer converting bytes data directly into image objects, that is, PIL Images:
img is the same as the result obtained by the
image.load_img method. There are two things to pay attention to:
• You must do RGB conversion.
• You must resize (resize is the second parameter of the
Images, such as screenshots, may occasionally have quite a few black borders. These black borders have no practical value and cause much interference. For this reason, removing black borders is also a common practice.
A black border is essentially a row or column of pixels where all pixels are (0, 0, 0) (RGB image). To remove the black border is to find these rows or columns and delete them. This is actually a 3-D matrix multiplication in NumPy.
An example of removing horizontal black borders:
This is pretty much what I want to talk about using CNN to extract image features and implement other image processing. Now let’s take a look at vector search engines.
The problem of extracting feature vectors from images has been solved. Then the remaining problems are:
The open-source vector search engine Milvus can solve these two problems. So far, it has been running well in our production environment.
Extracting feature vectors from an image is far from enough. We also need to dynamically manage these feature vectors (addition, deletion, and update), calculate the similarity of the vectors, and return the vector data in the nearest neighbor range. The open-source vector search engine Milvus performs these tasks quite well.
The rest of this article will describe specific practices and points to be noted.
To use Milvus, your CPU must support the avx2 instruction set. For Linux systems, use the following command to check which instruction sets your CPU supports:
cat /proc/cpuinfo | grep flags
Then you get something like:
What follows flags is the instruction sets your CPU supports. Of course, these are much more than I need. I just want to see if a specific instruction set, such as avx2, is supported. Just add a grep to filter it:
cat /proc/cpuinfo | grep flags | grep avx2
If no result is returned, it means that this specific instruction set is not supported. You need to change your machine then.
Capacity planning is our first consideration when we design a system. How much data do we need to store? How much memory and disk space does the data require?
Let’s do some quick maths. Each dimension of a vector is float32. A float32 type takes up 4 Bytes. Then a vector of 512 dimensions requires 2 KB of storage. By the same token:
If we want to store all the data in the memory, then the system needs at least the corresponding memory capacity.
It is recommended that you use the official size calculation tool: Milvus sizing tool.
Actually our memory may not be that big. (It doesn’t really matter if you don’t have enough memory. Milvus automatically flushes data onto the disk.) In addition to the original vector data, we also need to consider the storage of other data such as logs.
For more information about the system configuration, see the Milvus documentation:
Collection and Partition
The underlying implementation of partition is actually the same with that of collection, except that a partition is under a collection. But with partitions, the organization of data becomes more flexible. We can also query a specific partition in a collection to achieve better query results.
How many collections and partitions can we have? The basic information on collection and partition is in Metadata. Milvus uses either SQLite (Milvus internal integration) or MySQL (requires external connection) for internal metadata management. If you use SQLite by default to manage Metadata, you will suffer severe performance loss when the numbers of collections and partitions are too large. Therefore, the total number of collections and partitions should not exceed 50,000 (Milvus 0.8.0 will limit this number to 4,096). If you need to set a larger number, it is recommended that you use MySQL via an external connection.
The data structure supported by Milvus' collection and partition is very simple, that is,
ID + vector. In other words, there are only two columns in the table: ID and vector data.
When we use traditional databases, we can specify field values as filtering conditions. Though Milvus does not filter exactly the same way, we can implement simple conditional filtering using collections and partitions. For example, we have a large amount of image data and the data belongs to specific users. Then we can divide the data into partitions by user. Therefore, using the user as the filter condition is actually specifying the partition.
Structured Data and Vector Mapping
Milvus only supports the ID + vector data structure. But in business scenarios, what we need is structured data-bearing business meaning. In other words, we need to find structured data through vectors. Accordingly, we need to maintain the mapping relations between structured data and vectors through ID.
structured data ID <--> mapping table <--> Milvus ID
You can refer to the following articles:
Types of index: https://www.milvus.io/docs/v0.10.1/index.md
How to select index: https://medium.com/@milvusio/how-to-choose-an-index-in-milvus-4f3d1...
The search results of Milvus are a collection of
ID + distance:
Filtering Data Whose ID is -1
When the number of collections is too small, the search results may contain data whose ID is -1. We need to filter it out by ourselves.
The search for vectors is quite different. The query results are sorted in descending order of similarity, and the most similar (topK) of results are selected (topK is specified by the user at the time of query).
Milvus does not support pagination. We need to implement the pagination function by ourselves if we need it for business. For example, if we have ten results on each page and only want to display the third page, we need to specify that topK = 30 and only return the last ten results.
Similarity threshold for business
The distance between the vectors of two images is between 0 and 1. If we want to decide whether two images are similar in a specific business scenario, we need to specify a threshold within this range. The two images are similar if the distance is smaller than the threshold, or they are quite different from each other if the distance is larger than the threshold. You need to adjust the threshold to meet your own business needs.