Investigating Kernel issue - Rolling outages on Stack 2
Scheduled Maintenance Report for Zoey
Completed
As of this time we can consider this incident closed. On December 27, 2017 at approximately 8pm EST Zoey experienced a failure in one of its computing nodes in the Stack2 Computing Cluster. This issue was corrected with a simple restart of services and the outage lasted approximately twenty minutes. Over the following ten days we continued to experience, at random, one of our computing nodes in our Stack 2 cluster failing. Through intensive troubleshooting and over fifteen unique resolutions we were unable to find a fix. Each of these outages was short in duration, less than twenty minutes, and affecting less than 10% of Zoey customers at any given time. Through further troubleshooting we discovered that the error was a bug in the underlying Linux Kernel and worked with the appropriate third party team in implementing a patch. At this time Zoey believes this error has been corrected. Further information about the patch can be found at https://patchwork.ozlabs.org/patch/712373/ - please note this is a very short and technical summary of the overall issue.
Posted Jan 09, 2017 - 12:09 EST
Update
We continue to see no interruptions in connectivity and no impact to underlying services. Although we believe that the issue has been mitigated, given the complicated nature of this problem we will leave this incident open until mid-day Monday, January 9, 2017 Eastern Time.
Posted Jan 07, 2017 - 15:40 EST
Update
At this time we are turning on all underlying Zoey Services that were disabled during the troubleshooting. None of these services impacted the transactional capabilities of your Store and are internal tools that Zoey uses. We believe that the root cause of the issue has been identified and a fix has been put in place. Although we believe that the issue has been mitigated, given the complicated nature of this problem we will leave this incident open until mid-day Monday, January 9, 2017 Eastern Time. A complete write up will be available shortly thereafter. Thank you again for your patience while we worked through this issue.
Posted Jan 06, 2017 - 15:16 EST
Update
After beginning the migrations of stores we discovered a possible other set of solutions to the problem that we are experiencing. We have completed the work to implement these fixes and are now monitoring the stability of all computing nodes on Stack 2. If this final set of fixes do not correct the issue we will begin migrating stores to the infrastructure that was setup to receive them. We will post an update this afternoon on our expected plan of action. Please note that these rolling outages have affected a small percentage of Zoey customers at a time and generally result in 15 minutes of downtime once a day - although we completely understand and realize this is not acceptable we just wanted to clarify the scope of these discussions as it is by no means a total outage or any sustained periods of time. Our Engineers have been monitoring the situation 24x7 and responding at all hours of the day or night when an issue popped up. Thank you for your continued patience.
Posted Jan 06, 2017 - 10:52 EST
Update
The fixes that we have deployed have not provided the stability that we have been looking for. Therefore, the plan of action is to create new servers and to migrate all affected stores to these new servers. This operation was completed last night and we are beginning the process of transferring stores to the new servers. We will have more information about this process once we are underway.
Posted Jan 05, 2017 - 11:26 EST
Update
We are currently finishing updating the remaining four computing nodes with our fix. We expect this work to take twenty to thirty minutes. We will post an update as soon as this work is done.
Posted Jan 04, 2017 - 18:48 EST
Verifying
Our Engineering team is currently in the process of applying a new patch and may impact connectivity with your store. Downtime should be minimized to about twenty minutes if you are affected. Further information will be available in a postmortem
Posted Jan 04, 2017 - 16:47 EST
In progress
Ongoing 20min rolling outages reported across Stack 2 reported. Work continues to investigate and resolve the root cause of this issue.
Posted Jan 04, 2017 - 00:51 EST
Verifying
A handful of sites experienced a momentary outage at 7:56pm EST. This was resolved and stores were back online by 8:13pm EST.
Posted Jan 03, 2017 - 20:18 EST
In progress
We are currently performing an additional step that is causing a momentary outage of about twenty minutes to a limited subset of customers. Thank you for your patience.
Posted Jan 03, 2017 - 20:08 EST
Verifying
All Stores are verified to be back online. A majority of affected stores were only down for approximately twenty minutes. We are currently reviewing that Stores are functioning/accessible. If you experience any issues accessing your store please open a Support Ticket at http://support.zoey.com
Posted Jan 03, 2017 - 17:48 EST
Update
Maintenance has been completed and Stores are now coming back online. We appreciate your patience and please note that the maintenance window extends to 9pm.
Posted Jan 03, 2017 - 17:07 EST
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Jan 03, 2017 - 16:30 EST
Scheduled
Zoey customers may experience rolling outages from 5 p.m. to 9 p.m. EST today. This is as a result of an ongoing error condition which causes connectivity to stores to stop working. Our Engineering team first noticed this issue on December 28 and has been tracking down the root cause as it is random and occurs with no specific trigger. Our team plans to use this time to further investigate this issue.
Posted Jan 03, 2017 - 16:21 EST