Milvus aims to achieve efficient similarity search and analytics for massive-scale vectors. A standalone Milvus instance can easily handle vector search for billion-scale vectors. However, for 10 billion, 100 billion, or even larger datasets, a Milvus cluster is needed. The cluster can be used as a standalone instance for upper-level applications and can meet the business needs of low latency, high concurrency for massive-scale data. A Milvus cluster can resend requests, separate reading from writing, scale horizontally, and expand dynamically, thus providing a Milvus instance that can expand without limit. Mishards is a distributed solution for Milvus.
This article will briefly introduce components of the Mishards architecture. More detailed information will be introduced in the upcoming articles.
Mishards is responsible for breaking up upstream requests and routing sub-requests to sub-services. The results are summarized to return to upstream.
As is indicated in the chart above, after accepting a TopK search request, Mishards first breaks up the request into sub-requests and send the sub-requests to the downstream service. When all sub-responses are collected, the sub-responses are merged and returned to upstream.
Because Mishards is a stateless service, it does not save data or participate in complex computation. Thus, nodes do not have high configuration requirements and the computing power is mainly used in merging sub-results. So, it is possible to increase the number of Mishards nodes for high concurrency.
Milvus nodes are responsible for CRUD related core operations, so they have relatively high configuration requirements. Firstly, the memory size should be large enough to avoid too many disk IO operations. Secondly, CPU configurations can also affect performance. As the cluster size increases, more Milvus nodes are required to increase system throughput.
Read-only nodes and writable nodes
Only one writable node is allowed
Read-only node scalability
In a distributed system, Milvus writable nodes are the only producer of metadata. Mishards nodes, Milvus writable nodes, and Milvus read-only nodes are all consumers of metadata. Currently, Milvus only supports MySQL and SQLite as the storage backend of metadata. In a distributed system, the service can only be deployed as highly-available MySQL.
Keywords: Apache Zookeeper, etcd, Consul, Kubernetes
Service discovery provides information about all Milvus nodes. Milvus nodes register their information when going online and log out when going offline. Milvus nodes can also detect abnormal nodes by periodically checking the health status of services.
Service discovery contains a lot of frameworks, including etcd, Consul, ZooKeeper, etc. Mishards defines the service discovery interfaces and provides possibilities for scaling by plugins. Currently, Mishards provides two kinds of plugins, which correspond to Kubernetes cluster and static configurations. You can customize your own service discovery by following the implementation of these plugins. The interfaces are temporary and need a redesign. More information about writing your own plugin will be elaborated in the upcoming articles.
Keywords: Nginx, HAProxy, Kubernetes
Service discovery and load balancing are used together. Load balancing can be configured as polling, hashing, or consistent hashing.
The load balancer is responsible for resending user requests to the Mishards node.
Each Mishards node acquires the information of all downstream Milvus nodes via the service discovery center. All related metadata can be acquired by metadata service. Mishards implements sharding by consuming these resources. Mishards defines the interfaces related to routing strategies and provides extensions via plugins. Currently, Mishards provides a consistent hashing strategy based on the lowest segment level. As is shown in the chart, there are 10 segments, s1 to s10. Per the segment-based consistent hashing strategy, Mishards routes requests concerning s1, 24, s6, and s9 to the Milvus 1 node, s2, s3, s5 to the Milvus 2 node, and s7, s8, s10 to the Milvus 3 node.
Based on your business needs, you can customize routing by following the default consistent hashing routing plugin.
Keywords: OpenTracing, Jaeger, Zipkin
Given the complexity of a distributed system, requests are sent to multiple internal service invocations. To help pinpoint problems, we need to trace the internal service invocation chain. As the complexity increases, the benefits of an available tracing system are self-explanatory. We choose the CNCF OpenTracing standard. OpenTracing provides platform-independent, vendor-independent APIs for developers to conveniently implement a tracing system.
The previous chart is an example of tracing during search invocation. Search invokes
do_search also invokes
The whole tracing record forms the following tree:
The following chart shows examples of request/response info and tags of each node:
OpenTracing has been integrated to Milvus. More information will be covered in the upcoming articles.
Keywords: Prometheus, Grafana
Milvus has integrated Prometheus to collect metric data. Grafana implements metric-based monitoring and Alertmanager is used for alerting. Mishards will also integrate Prometheus.
Keywords: Elastic, Logstash, Kibana
For clustering service, log files are distributed in different nodes. To pinpoint problems, you need to log to the corresponding servers to get the logs. Because multiple log files need to be analyzed together, server log analysis with the ELK stack is a good choice.
As the service middleware, Mishards integrates service discovery, routing request, result merging, and tracing. Plugin-based expansion is also provided. Currently, distributed solutions based on Mishards still have the following setbacks:
In the future, we will continue to improve Mishards so it can be applied to the production environment more conveniently. We welcome any feedback and suggestions and hope we can build a better open-source tool together!