细微之处决定胜败：从云厂商事故报告中学习经验教训

作者：WuKongCoder

2023-11-28
浙江
本文字数：5110 字
阅读完需：约 17 分钟

介绍：

在云计算时代，各大主流云服务提供商在全球范围内提供了强大的基础设施和服务。然而，随着云服务的不断普及，事故不可避免地发生。

软件时代，可以说没有一个不存在 bug 的软件，同样，在云计算时代，我们也可以说没有一个云服务是绝对没有事故的。

为了透明化运营、分享教训，各大云厂商也会发布相关事故报告，详细说明出现故障的根本原因以及采取的应对措施。

在这篇文章中，我们将深入研究各大云厂商的事故报告，从中汲取经验教训。我们将学习它们如何从事故中吸取的宝贵教训。让我们一同揭示这些云计算领域的"细微之处"，它们可能决定了胜败的关键。

省流版：

国外：

AWS，Azure，Google 各有千秋，Azure，Google 的事故报告更为全面以及详细，AWS 只是对一些影响面较大的事件做了详细的报告，总体上都很不错。

国内：

阿里云一枝独秀，对齐国外云厂商，风格上也是有自己的理解，偏时间线叙述，也是对一些影响面较大的事件会做详细的根因分析以及改进措施。

腾讯云和华为云基本上还没有关注这块的建设，有待提高。

同时腾讯云竟然海外版还会专门写一篇详尽的事件报告，反而国内这块每次只是相当于发布一个公告而已，让人很是困惑。

详细篇

AWS 云篇

AWS 事后摘要：

https://aws.amazon.com/cn/premiumsupport/technology/pes/

AWS Health Dashboard 提供全球 AWS 服务的最新运行状况信息以及过去 12 个月的服务历史记录。当某个问题对客户产生广泛而重大的影响，导致很大比例的控制平面 API 调用失败，影响很大比例的服务基础设施、资源或 API，或者是由于总电源故障或严重网络故障导致时，AWS 致力于在问题结束后提供公开的事后摘要 (AWS Post-Event Summaries)。事件后摘要将保留至少 5 年，并将提供问题的影响范围、导致问题的因素以及为解决已识别风险而采取的行动。

截止 23 年 11 月份底，官网目前保留的事后摘要一共 16 篇，从 2011 到 2023 年跨度。这些事故可以说 AWS 中影响比较大的事故了，推荐大家有时间阅读下，尤其是云计算从业者，可以说这些资料都是很有用处的，学习前人踩过的坑，有益于自己避坑。

阿里云篇

阿里云的事故报告虽然只包含最近一年的（23 年截止目前 17 篇），但是属于所有事件并且每个事件按清晰的时间线叙述，很方便大家阅读。

阿里云历史事件（展示最近一年阿里云发生过的事件）：

https://status.aliyun.com/#/historyEvent

同时阿里云上比较重大的事故也会有对应的事故报告，比如最近 11 月 12 号控制台服务异常，也会有事件清晰的发生时间线以及后面的事故报告，这点还是值得称赞的。

华为云篇

目前这块比较欠缺，待完善，对齐其他云厂商

华为云服务历史事件看板：

https://status.service.huaweicloud.com/intl/#/summary

这里只标识了三个数字，未解决问题数（0），历史已解决问题数（1），问题总数（1），具体什么问题以及内容一概没有，只能说这个页面相对单薄了一点。

难道历史上真的只有过一次问题发生？我理解华为云的服务稳定性不会比阿里云高出一两个级别吧？这个历史事件看板如此单薄，我个人理解只能是目前云厂商没有注重或者敢于将事故报告公布给广大客户，直面问题的勇气。

腾讯云篇

分国内版，海外版两者差异很大，希望后续能够保持一致

国内版：

腾讯云事件后摘要看板：

https://cloud.tencent.com/announce

对的，你没看错。。。，我通过官网点击进来竟然是公告板，可能目前腾讯云没有将事件后摘要单独分类出来，暂且理解，我通过搜索关键字“故障”，还是能够查询到从 18 年到现在，大概 4 页的相关事件的公告的。具体内容我大致看了下，基本都是一行字，大致的含义是：相关服务发生故障，当前已恢复，等等。

海外版：

海外版就比较有意思了。腾讯云事件后摘要看板（海外版）：

https://www.tencentcloud.com/zh/render/details7

官网声称包含至少五年内的事件后摘要 notice：post-event summaries will remain available for a minimum of 5 years

海外版，虽然只有一篇事件报告，这个暂且不问原因，我们就先“学习下”内容。就是一篇超级标准的事故包含，堪称优秀。这里我就不做翻译了，直接原文摘录，供大家一起学习。

Problem Statement
On Wednesday, April 18, 2023, there was an issue with the database instance IP used by a production environment in the cloud monitoring platform. As a result, certain parts of the cloud monitoring console experienced abnormal functionality. The issue persisted from 17:00 to 17:43 UTC+8. We sincerely apologize for the inconvenience caused by the abnormal status of the cloud monitoring service, which had a negative impact on your user experience.
Incident Background
In a production environment of the cloud monitoring platform, there was an unfortunate incident where a database instance was mistakenly detached from its migration identifier before the migration process was completed. As a result, the database without the necessary identifier became inaccessible through the old IP address, particularly during high-load high-availability (HA) switching scenarios. This resulted in abnormal database connections within the production environment services.
What was the specific reason for the unsuccessful switch?
The high-load condition prompted a routine high-availability master-slave switch for the CDB. However, the database instance had not completed its migration and was mistakenly labelled as migrated. As a result, the previous Virtual IP (VIP) became invalid during the switch, leaving only the new VIP accessible. This discrepancy led to abnormal database connections for the old VIP. Consequently, any traffic that had not completely transitioned to the new IP was unable to access the database, resulting in connection failures and subsequent service unavailability.
Was there any data loss?
We want to assure you that no data loss occurred during the incident.
What happened during the incident?
Start time of the incident: At UTC time 16:38, a high-load alert was triggered in the cluster, indicating that the CPU utilization exceeded 95%.
Trigger of the incident: At 16:38 UTC, the occurrence of slow queries began to rise. This was followed by a decline in the success rate of the business layer at 16:58 UTC. Simultaneously, the number of slow queries reached its peak.
Troubleshooting process: The troubleshooting process involved examining the service logs of the business layer. It was discovered that a specific database IP was encountering abnormal access. Upon investigation, it was determined that the database instance associated with that IP had been experiencing a high load and had undergone a high-availability failover, rendering the old IP invalid. Subsequent investigation revealed that the instance had been incorrectly labelled as migrated.
Steps Taken:
All configurations utilizing the old IP in the configuration center were scanned, and the database access IP address was updated to an available IP address.
The database operations team was contacted to manually initiate high-availability (HA) and restore the availability of the old address through manual intervention.
Incident Recovery: After restoring the database access, the success rate of the business layer service interface improved, and the functionality of the cloud monitoring console returned to normal.
Impact
1.The alarm console became non-functional, affecting the alarm history display and hindering users from performing regular console operations.2.Alarm notifications encountered issues with retrieving alarm notification message content, resulting in the failure to send alarm notifications as intended.3.The Dashboard console became inaccessible, preventing users from accessing monitoring data and displaying error messages indicating operation failures.
Next Steps and Action Plan
The following measures will be implemented to prevent a recurrence of the incident.
A thorough review of the migration status for all database instances requiring migration will be conducted. The database instances will be marked as completed only when no access records are associated with the old IP.
Accelerated the migration progress to ensure the completion of all pending database instance migrations by the first half of 2023.
Implemented standardization of database usage, enhanced monitoring capabilities, performed proactive capacity expansion for high-load instances, and optimized slow query logic to eliminate any inefficiencies.