Ali senior technical experts: how to maintain system stability in double 11 trillion flow?

Tair history

Tair is widely used in Alibaba, whether it is Taobao Lynx Browse Orders, or open Youku browse play, behind the Tair figure silently support huge traffic. Tair’s history is as follows:

  • 2010.04 Tair v1.0 officially launched @ Taobao core system;
  • 2012.06 Tair v2.0 Launches LDB Persistence Products to Meet Persistent Storage Requirements;
  • 2012.10 Launched RDB cache product and introduced Redis interface to meet the storage requirements of complex data structures.
  • 2013.03 Reducing the lead-in time and access delay on the on-line Fastdump products on the basis of LDB on the basis of LDB import;
  • 2014.07 Tair v3.0 officially launched, performance several times increase;
  • 2016.11 Thai intelligent operation and maintenance platform on the line, helping 2016 double 11 into the billions of times;
  • 2017.11 Performance Leap, Hot Hash, Resource Scheduling, Supporting Trillions of Traffic.
  • Tair is a high-performance, distributed, scalable, and reliable key / value fabric storage system! Tair features are mainly reflected in the following areas:
  • High Performance: Guaranteed Low Latency at High Throughput, Tair is one of the largest systems in the Alibaba Group, with double 11 calls at 500 million spikes per second with an average latency of less than 1 millisecond;
  • High availability: Through the automatic failover, current limiting, auditing and computer room disaster and multi-unit multi-area, to ensure that the system can work under any circumstances;
  • Large-scale: distributed data centers around the world, Ali BU each BU are in use;
  • Business coverage: e-commerce, ants, one, rookie, high German, Ali health and so on.
  • Tair in addition to the common Key / Value system provides functions such as get, put, delete, and bulk interfaces, there are some additional practical features, making it a wider application scenarios. Tair application scenarios include the following four:
  • MDB Typical Application Scenario: Used for caching and reducing the pressure on the back-end database. For example, the products in Taobao are cached in Tair. For temporary data storage, some data loss will not have a big impact on the service, for example, login ;
  • LDB typical application scenarios: common kv storage, transaction snapshot, security control, etc .; storage of black and white single data, read qps high; counter function, the update is very frequent, and the data can not be lost.
  • Typical RDB scenarios: Caching and storing complex data structures, such as playlists, live broadcasts, etc.
  • FastDump Typical Application Scenario: Periodically offline data is quickly imported into the Tair cluster, the rapid use of new data, the requirements for online read very high; read low latency, can not have glitches.

Double 11 challenge how to do?

As shown in the figures for 2012-2017, the GMV was less than 20 billion in 2012, 168.2 billion GMV in 2017, the peak of transaction creation was from 14,000 to 325,000 and the peak QPS was increased from 13 million to nearly 500 million.

As can be seen from the figure, tair visit growth rate far greater than the transaction to create the peak, the transaction is also greater than the peak to create GMV growth. At 0:00, for Tair, the challenge is how to ensure low latency and how to ensure that costs are lower than business growth.

For distributed storage systems, hot issues are more difficult to solve. The caching system traffic is particularly large, hot issues are more prominent. 2017 Double 11, we passed the hot hash, completely solve the problem of cache hot spots.

At the same time, in order to carry 325,000 transactions per second, Ali’s technology architecture has evolved into a multi-geo-multi-cell architecture, not only using the unit on Aliyun, but also has a mix of offline services unit, where the challenge for us How to quickly and flexibly deploy and off-site cluster.

Multi-regional multi-unit

Look at our general overall deployment architecture and tair in the system position. As you can see from this diagram, we are a multi-site multi-cell multi-cell deployment architecture. The entire system from the traffic access layer, to the application layer. Then the application layer relies on a variety of middleware, such as message queues, configuration centers and so on. The bottom is the underlying data layer, tair and database. At the data level, we need to do the required data synchronization for the business to ensure the top-level business is stateless.

Multi-geo-multi-cell In addition to preventing black swans, another important role is to be able to achieve the flow of the bearer section by going online one unit quickly. Tair has also made a complete set of control system to achieve fast and flexible station building.

Flexible station

Tair itself is a very complex distributed storage system, the scale is very large. So we have an operation and management platform In this process through the task scheduling, task execution, validation and delivery processes to ensure fast one-click station, from the fast mixing fast cluster on the work. After the deployment is completed, it will undergo a series of system, cluster, instance connectivity verification to ensure that the service is complete and then delivered on-line use. If there is a hint of omission, a large-scale failure may be triggered when business traffic comes up. In this case, in the case of a persisted cluster with data, after the deployment is completed, it is also necessary to wait for the migration of the stock data to complete and the data to be synchronized before entering the verification phase.

Each Tair’s business cluster has a different water level. For each full link measurement before double 11, the Tair resources used will change as the business model changes, resulting in changes in water levels. In this case, we need to suppress Tair resources scheduled across multiple clusters each time. If the water level is low, some of the machine’s server resources will be moved to the water level to reach the value of all cluster water levels.

data synchronization

Multi-geographic multi-unit, we must be able to do the data layer to achieve data synchronization, and can provide a variety of business read and write modes. For the unitized business, we provide the unit with the ability to access local Tair, and for some non-unitized businesses, we also provide a more flexible access model. Synchronization delay is something we have been doing. In 2017, with double-digit synchronization data of 10 million units per second, how to better solve the problem of non-unitized data write conflicts in multi-cell units? This is what we have always considered.

Performance optimization costs down

Server costs do not fall linearly at 30% or 40% per year as traffic increases linearly. We achieve this goal primarily through server performance optimization, client performance optimization, and different business solutions.

First look at how we from the server side to enhance performance and reduce costs. Here’s the work is divided into two major pieces: one is to avoid thread switching scheduling, reducing lock competition and lock-free, the other is to use the user state protocol stack + DPDK to run-to-completion in the end.

Memory data structure

We will apply for a large chunk of memory after the process has started, and format the formatting in memory. The main slab allocator, hashmap and memory pool, the memory will be filled after the LRU chain data obsolete. As the number of server CPUs continues to grow, it is difficult to improve overall performance if the lock contention is not well managed.

By referring to the various references, combined with tair’s own engine needs, we used fine-grained locks, lock-free data structures, CPU native data structures and RCU mechanisms to improve engine parallelism. The figure on the left shows the CPU consumption graph of each function module without optimization. You can see that the network part and the data search part consume the most. After optimization (right), 80% of the processing is on the network and data lookup, which is in line with We expect.

User mode protocol stack

Lock optimization, we found that a lot of CPU consumption in the kernel state, then we use DPDK + Alisocket to replace the original kernel state protocol stack, Alisocket uses DPDK in the user mode to receive the card, and use its own protocol stack to provide socket API , To integrate it. We compare tair, memcached and industry-leading seastar frameworks with performance gains of over 10% on seastars.

Memory consolidation

When performance increases, the amount of memory used by a unit of qps becomes less, so memory becomes scarce. Another status quo, tair is a multi-tenant system, each business behavior is not the same, often result in the page has been assigned, but many pages in slab are not full. A small number of slabs are indeed fully occupied, resulting in the appearance of capacity, but unable to allocate data.

At this point, we implemented a function that merges unused page memory in the same slab, freeing up a lot of free memory. Can be seen from the figure, in the same slab, record the usage of each page, and mount to different specifications of the bucket. The merger, the use of low page to high page usage merger. Also need to be associated with each data structure, including the LRU chain, equivalent to the entire memory structure of the reorganization. This feature is particularly effective in public clusters on the Internet, according to different scenarios, you can significantly improve the memory usage efficiency.

Client optimization

These are server-side changes, then look at the client’s performance. Our client is running on the client server, so take up the client’s resources. If we can reduce resource consumption as low as possible, for our entire system, the cost is favorable. Clients have done two aspects of optimization: network framework replacement, adaptation coroutine, from the original mina to netty, the throughput increased by 40%; serialization optimization, integrated kryo and hessian, the throughput increased by 16% +.

Memory grid

How to reduce overall Tair and business costs in combination with business? Tair provides multiple levels of storage integration to solve business problems, such as security risk control scenarios, read and write large, there is a large number of local computing, we can store the business machine in the local machine to access the data to be accessed, a large number of readers will hit the local , And writes can be merged over a period of time, after a certain period of time, the merged writes to the far-end Tair cluster as the final storage. We provide read-and-write penetration, including merger writing and the ability to have multiple copies of the original Tair itself, reducing the reading of Tair to 27.68% and the writing of Tair to 55.75% at 11am.

Hot problem has been solved

Cache breakdown

Cached from the beginning of a single point of development to distributed systems, organized by data sharding, but for each data shards, or as a single point exists. When there is a big promotion or hot news, the data is often on a slice, which will result in a single point of access, and then cache a node will not be able to withstand such a lot of pressure, resulting in a large number of requests no way to respond. A caching system is a self-protection method is limited. But the current limit for the entire system, does not work. After the current limit, part of the flow will go to the database, it still just said can not afford the same result, the entire system is abnormal.

So here, the only solution is that the caching system can serve as the endpoint for traffic. Whether big promotion, or hot news, or the business of their own anomaly. Cache can absorb these traffic out, and let the business to see the hot situation.

Hot hash

After a variety of programs to explore, using a hot hash scheme. We have evaluated the client-side local cache scheme and the secondary cache scheme, and they can solve the hot issue to a certain extent, but each has its own drawbacks. For example, the number of secondary cache servers can not be estimated, and the impact of the local cache scheme on the service-side memory and performance. The hot hash directly on the data node plus hotzone area, hotzone bear hot data storage. For the entire program, the key is the following steps:

Intelligent Recognition. Hot data is always changing, or frequency hot spots, or traffic hot spots. The internal implementation uses a multi-level LRU data structure and sets different weights to be placed on different levels of LRUs. Once the LRU data is full, the LRU chains will be eliminated from the low-level LRU chains to ensure that the high weights are reserved.

Real-time feedback and dynamic hashing. When visiting the hotspot, the appserver and the server will be linked together and dynamically hashed to other data hotZones according to the preset access model. All nodes in the cluster will take this function.

In this way, we will be the original single-point access to bear the traffic through some machines in the cluster to bear.

The whole project is very complex, hot hash has achieved a very significant effect in double 11. Peaks absorb more than 800 w of traffic per second. As you can see from the diagram to the right, the red line is the water level of the hot hash if the hot water level is not turned on and the green line is the hot water level. If not, many clusters exceed the death level, which is 130% of our cluster level. When turned on, the water level drops below the safety line by hashing the hotspots throughout the cluster. In other words, if not turned on, then many clusters may have problems.

Write hot

Write hotspots and hotspots have similarities, this is mainly through the merger write operation to implement. The first is still to identify the hot spots, if it is a hot write operation, then the request will be distributed to a special hot-combined thread processing, the thread based on the write request for a certain period of time the merger, followed by the timing thread in accordance with the default consolidation Cycle the merged request submitted to the engine layer. In this way to significantly reduce the pressure on the engine layer.

After a double 11 test on the reading and writing hot deal, we can safely say that Tair caching, including kv storage read and write hot spots completely resolved.