细微之处决定胜败:从云厂商事故报告中学习经验教训
介绍:
在云计算时代,各大主流云服务提供商在全球范围内提供了强大的基础设施和服务。然而,随着云服务的不断普及,事故不可避免地发生。
软件时代,可以说没有一个不存在 bug 的软件,同样,在云计算时代,我们也可以说没有一个云服务是绝对没有事故的。
为了透明化运营、分享教训,各大云厂商也会发布相关事故报告,详细说明出现故障的根本原因以及采取的应对措施。
在这篇文章中,我们将深入研究各大云厂商的事故报告,从中汲取经验教训。我们将学习它们如何从事故中吸取的宝贵教训。让我们一同揭示这些云计算领域的"细微之处",它们可能决定了胜败的关键。
省流版:
国外:
AWS,Azure,Google 各有千秋,Azure,Google 的事故报告更为全面以及详细,AWS 只是对一些影响面较大的事件做了详细的报告,总体上都很不错。
国内:
阿里云一枝独秀,对齐国外云厂商,风格上也是有自己的理解,偏时间线叙述,也是对一些影响面较大的事件会做详细的根因分析以及改进措施。
腾讯云 和华为云 基本上还没有关注这块的建设,有待提高。
同时腾讯云竟然海外版还会专门写一篇详尽的事件报告,反而国内这块每次只是相当于发布一个公告而已,让人很是困惑。
详细篇
AWS 云篇
AWS 事后摘要:
https://aws.amazon.com/cn/premiumsupport/technology/pes/
AWS Health Dashboard 提供全球 AWS 服务的最新运行状况信息以及过去 12 个月的服务历史记录。当某个问题对客户产生广泛而重大的影响,导致很大比例的控制平面 API 调用失败,影响很大比例的服务基础设施、资源或 API,或者是由于总电源故障或严重网络故障导致时,AWS 致力于在问题结束后提供公开的事后摘要 (AWS Post-Event Summaries)。事件后摘要将保留至少 5 年,并将提供问题的影响范围、导致问题的因素以及为解决已识别风险而采取的行动。
截止 23 年 11 月份底,官网目前保留的事后摘要一共 16 篇,从 2011 到 2023 年跨度。这些事故可以说 AWS 中影响比较大的事故了,推荐大家有时间阅读下,尤其是云计算从业者,可以说这些资料都是很有用处的,学习前人踩过的坑,有益于自己避坑。
阿里云篇
阿里云的事故报告虽然只包含最近一年的(23 年截止目前 17 篇),但是属于所有事件并且每个事件按清晰的时间线叙述,很方便大家阅读。
阿里云历史事件(展示最近一年阿里云发生过的事件):
https://status.aliyun.com/#/historyEvent
同时阿里云上比较重大的事故也会有对应的事故报告,比如最近 11 月 12 号控制台服务异常,也会有事件清晰的发生时间线以及后面的事故报告,这点还是值得称赞的。
华为云篇
目前这块比较欠缺,待完善,对齐其他云厂商
华为云服务历史事件看板:
https://status.service.huaweicloud.com/intl/#/summary
这里只标识了三个数字,未解决问题数(0),历史已解决问题数(1),问题总数(1),具体什么问题以及内容一概没有,只能说这个页面相对单薄了一点。
难道历史上真的只有过一次问题发生?我理解华为云的服务稳定性不会比阿里云高出一两个级别吧? 这个历史事件看板如此单薄,我个人理解只能是目前云厂商没有注重或者敢于将事故报告公布给广大客户,直面问题的勇气。
腾讯云篇
分国内版,海外版 两者差异很大,希望后续能够保持一致
国内版:
腾讯云事件后摘要看板:
https://cloud.tencent.com/announce
对的,你没看错。。。,我通过官网点击进来 竟然是公告板,可能目前腾讯云没有将事件后摘要单独分类出来,暂且理解,我通过搜索关键字“故障”,还是能够查询到从 18 年到现在,大概 4 页的相关事件的公告的。具体内容我大致看了下,基本都是一行字,大致的含义是:相关服务发生故障,当前已恢复,等等。
海外版:
海外版就比较有意思了。腾讯云事件后摘要看板(海外版):
https://www.tencentcloud.com/zh/render/details7
官网声称包含至少五年内的事件后摘要 notice:post-event summaries will remain available for a minimum of 5 years
海外版,虽然只有一篇事件报告,这个暂且不问原因,我们就先“学习下”内容。就是一篇超级标准的事故包含,堪称优秀。这里我就不做翻译了,直接原文摘录,供大家一起学习。
Problem Statement
On Wednesday, April 18, 2023, there was an issue with the database instance IP used by a production environment in the cloud monitoring platform. As a result, certain parts of the cloud monitoring console experienced abnormal functionality. The issue persisted from 17:00 to 17:43 UTC+8. We sincerely apologize for the inconvenience caused by the abnormal status of the cloud monitoring service, which had a negative impact on your user experience.
Incident Background
In a production environment of the cloud monitoring platform, there was an unfortunate incident where a database instance was mistakenly detached from its migration identifier before the migration process was completed. As a result, the database without the necessary identifier became inaccessible through the old IP address, particularly during high-load high-availability (HA) switching scenarios. This resulted in abnormal database connections within the production environment services.
What was the specific reason for the unsuccessful switch?
The high-load condition prompted a routine high-availability master-slave switch for the CDB. However, the database instance had not completed its migration and was mistakenly labelled as migrated. As a result, the previous Virtual IP (VIP) became invalid during the switch, leaving only the new VIP accessible. This discrepancy led to abnormal database connections for the old VIP. Consequently, any traffic that had not completely transitioned to the new IP was unable to access the database, resulting in connection failures and subsequent service unavailability.
Was there any data loss?
We want to assure you that no data loss occurred during the incident.
What happened during the incident?
Start time of the incident: At UTC time 16:38, a high-load alert was triggered in the cluster, indicating that the CPU utilization exceeded 95%.
Trigger of the incident: At 16:38 UTC, the occurrence of slow queries began to rise. This was followed by a decline in the success rate of the business layer at 16:58 UTC. Simultaneously, the number of slow queries reached its peak.
Troubleshooting process: The troubleshooting process involved examining the service logs of the business layer. It was discovered that a specific database IP was encountering abnormal access. Upon investigation, it was determined that the database instance associated with that IP had been experiencing a high load and had undergone a high-availability failover, rendering the old IP invalid. Subsequent investigation revealed that the instance had been incorrectly labelled as migrated.
Steps Taken:
All configurations utilizing the old IP in the configuration center were scanned, and the database access IP address was updated to an available IP address.
The database operations team was contacted to manually initiate high-availability (HA) and restore the availability of the old address through manual intervention.
Incident Recovery: After restoring the database access, the success rate of the business layer service interface improved, and the functionality of the cloud monitoring console returned to normal.
Impact
1.The alarm console became non-functional, affecting the alarm history display and hindering users from performing regular console operations.2.Alarm notifications encountered issues with retrieving alarm notification message content, resulting in the failure to send alarm notifications as intended.3.The Dashboard console became inaccessible, preventing users from accessing monitoring data and displaying error messages indicating operation failures.
Next Steps and Action Plan
The following measures will be implemented to prevent a recurrence of the incident.
A thorough review of the migration status for all database instances requiring migration will be conducted. The database instances will be marked as completed only when no access records are associated with the old IP.
Accelerated the migration progress to ensure the completion of all pending database instance migrations by the first half of 2023.
Implemented standardization of database usage, enhanced monitoring capabilities, performed proactive capacity expansion for high-load instances, and optimized slow query logic to eliminate any inefficiencies.
多么好的一篇事故报告,段落清晰,描述详细,后续完善计划也清晰。 但是我有个疑问?
为啥只有海外版有这么详细的事件报告,国内版只配一句话带过? 这里面我不想过多揣测,只希望后续国内版也能有这么详细的事件报告可以供国内使用云服务的开发者阅读。
Azure 篇
Azure 历史事件:
https://azure.status.microsoft/en-us/status/history/
保留了最近五年的历史事件,支持按服务搜索,每篇事故包含根因分析 root cause analyses (RCAs) ,每篇文章都是固定格式,包含:
What happened?
What went wrong and why?
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
可以说非常详细了,值得深度阅读,推荐推荐。
Google 篇
Google 历史事件:
https://status.cloud.google.com/summary
google 的历史事件看版也是按产品分类的,但是不像 Azure 那样方便 支持搜索,需要先定位产品之后可以看到关于此产品的历史上相关事件,具体事件描述可以说非常详细了,也是按照时间线叙述的,和阿里云的风格类似, 但是细节上描述的十分详尽。
总结:
通过学习各大云厂商事故报告,我们可以学习宝贵的经验教训,这些教训有助于我们更好地理解和应对数字化时代的挑战。海外云服务提供商在面对问题时展现出的透明度和专业处理方式是我们值得学习的典范。
然而,同时也应认识到国内云厂商在事故报告透明度和应对措施方面仍有进步的空间。尽管云计算在国内得到了迅猛发展,但我们也要关注如何提高透明度,深挖问题根本原因,以及持续改进服务质量。在数字化时代,优秀的云服务不仅需要高效的技术架构,更需要对事故和故障的快速响应和全面解决方案。
因此,通过深入学习和借鉴海外云厂商的经验,国内云服务提供商能够更好地理解和应对潜在的风险,提高业务的可靠性和稳定性。这不仅关乎企业自身的发展,也关系到数字化时代整体的网络安全和稳定性。随着技术的不断演进,我们有信心看到国内云厂商在未来能够迎头赶上,并为用户提供更可靠、更高效的云服务。
云计算已经是当下技术人员的必学的一门课程,如果有时间也鼓励大家可以多了解学习,提升自己的专业能力。感兴趣的朋友,如果有任何问题,需要沟通交流也可以添加我的个人微信 coder_wukong,备注:云计算,或者关注我的公众号 WuKongCoder 日常也会不定期写一些文章和思考。
如果觉得文章不错,欢迎大家点赞,留言,转发,收藏 谢谢大家,我们下篇文章再会~~~
引用:
历史事件 展示最近一年阿里云发生过的事件
https://status.aliyun.com/#/historyEvent
Google Cloud History of incidents reported by product
https://status.cloud.google.com/summary
Azure status history
https://azure.status.microsoft/en-us/status/history/
AWS Post-Event Summaries
https://aws.amazon.com/cn/premiumsupport/technology/pes/
huawei health status
https://status.service.huaweicloud.com/intl/#/home
tencent health status
https://status.cloud.tencent.com/
tencent health status 海外版
https://www.tencentcloud.com/zh/support/security/health-dashboard
Tencent Cloud Incident Report (tcop)
版权声明: 本文为 InfoQ 作者【WuKongCoder】的原创文章。
原文链接:【http://xie.infoq.cn/article/59081d0e7c119fda3f4ac5f96】。文章转载请联系作者。
评论