The Availability and Performance analytics of Sina Weibo comment
Background
Sina Weibo is a Chinese microblogging website. (Like the Chinese version of Twitter.) Based on the Weibo Annual Report, Weibo has 225 million average DAUs (Daily Active Users).
Requirement
Users should be able to comment on any Weibo post or comment.
Capacity Estimation and Constraints
Post Weibo
Let's assume, on average, each user makes 1 post per day. (Only text) then the total post we have is around 250 million.
Most people use Weibo at 8:00~9:00 in the morning, 12:00~13:00 at noon, and 20:00~22:00 in the evening, assuming that these periods account for the total amount of Weibo is 60%, then the average TPS of posts in these 4 hours is calculated as follows:
250 million * 60% / (4 * 3600) ≈ 10 K/s。
Read Weibo
Let's assume on average, each post will have 100 views. Then the total number of views is 25 billion. The QPS of read Weibo is
25 billion * 60% / (4 * 3600) ≈ 1000 K/s。
Comment Weibo
Assume each post will have 5 comments on average. Then the total number of comments is 5 billion. The TPS of comment Weibo is 1.25 billion. Same as Post Weibo, the TPS is:
1.25 billion * 60% / (4 * 3600) ≈ 50 K/s。
Read Comment
Assuming the comment have the same views as post, we would have 25 billion reads per day. The QPS of reading comment is
25 billion * 60% / (4 * 3600) ≈ 1000 K/s。
Post Weibo
Analytics
Post Weibo is a write operation, we can use load balancing.
With such volume, we will use Multi-level load-balancing architecture, covering DNS -> F5 -> Nginx -> Gateway.
Design
1. The load-balancing algorithm
Only the login user can make a post. And the login status is generally stored in the distributed cache. Therefore, when posting Weibo, the request can be sent to any server. Here, we can use "Polling" or a "random" algorithm.
2. Number of servers
Posting Weibo involves several key processes: content auditing, writing data to storage (depending on the storage system), and writing data to the cache (depending on the cache system). Therefore, it is estimated that a service processes 500 per second, and the completion of 10K/s TPS requires 20 servers, plus some buffers, 25 servers are almost enough.
Read Weibo
Analytics
Read Weibo is a heavy read operation, and since a post is not editable after being posted, topical use case for caching. Also with such a volume of requests (25 billion), we need Multi-level load-balancing architecture.
Design
1. The load-balancing algorithm
Anyone can view Weibo, even without login, so the request can be sent to any server. We can use "Polling" or a "random" algorithm.
2. Number of servers
Assuming that the CDN will handle 90% of user traffic, then the remaining 10% of requests for reading Weibo will be direct to the system, and the request QPS is 1000K/s * 10% = 100K/s. Since the logic of reading Weibo is relatively simple, It is mainly a read cache system. Therefore, assuming that the processing capacity of a single business server is 1000/s, the number of machines is 100. According to the 20% reservation, the final number of machines is 120.
Multi-level load-balancing architecture
The caching architecture
Post Comment
Similar to posting Weibo, it's a heavy write operation. We will have the same architecture. (Multi-level load-balancing architecture)
We will need 100 servers + 20% buffer (20 servers) = 120 servers.
Design
While the comment feature is not required a very strict time SLA, we can use an asynchronous process to get better scalable and efficient.
We will push all the comment events into a queue. And will have workers consume those events (jobs) and update the cache and Database as an async process.
Read Comment
Similar to reading Weibo, it's a heavy read operation. We will have the same architecture. (Caching + Multi-level load-balancing architecture)
With the same QPS, we need a same number of servers, 120.
The Webo Architecture
HOT event
When there is a hot (incident) going on, there are some actions we can take to protect our server.
Service downgrade
What is service downgrade?
When the server pressure increases sharply, according to the actual business usage and traffic, some services and pages are strategically not processed or processed in a simple way, by releasing the resources of server resources to ensure the normal and efficient operation of the core business.
In our case, the core business is: posting weibo and reading weibo.
Non-core business is: posting comments and viewing comments.
So if there is a service downgrade situation, we can downgrade the posting comments (temporarily stop supporting posting comments or have an async job on the client and retry later) and view comments (temporarily not showing any comments).
Service fuse
The function of the service fuse is similar to the fuse in our home. When a service is unavailable or the response times out, to prevent the entire system from avalanching, the call to the service is temporarily stopped.
Rate Limit (Flow control)
If our server still can't handle the volume after the service downgrade (only the core business is available), we can use the rate limit to protect our server by simply controlling the request volume so at least we can still provide service for some users.
版权声明: 本文为 InfoQ 作者【David】的原创文章。
原文链接:【http://xie.infoq.cn/article/a36b27c155930f10f5098c8fc】。文章转载请联系作者。
评论