Storage Connectivity Issues

Incident Report for Zoey

Resolved

We are going to mark this incident resolved as all stores are back online and transactional for the past couple of hours. The root cause of this issue was due a bug in the storage array. Under very rare circumstances, if an unexpected controller restart or power failure occurred, an internal save might not have completed correctly. This issue results in a state requiring support intervention and discarding cached data. Due to the loss of this cached data we had to rebuilt several key pieces of the infrastructure that were impacted by the loss of the cached data. Due to the architecture of Zoey, it is possible that up to 1 second of data may have been lost although highly unlikely. The most probable manifestation of this issue would be if in that 1 second window a customer was placing an order, received approval from the payment gateway and Zoey could not complete the transaction due to the outage. If you notice any such issues please open a ticket with Support but the chances of this are extremely low given the 1 second window it could occur in. This window would have been at approximately 7:10am Eastern Daylight Time. We are now in the process of applying updated software to resolve this issue and will also verify the remainder of our infrastructure for this potential bug. If you notice or see any other issues please open a ticket with Support. Thank you for your patience during this time.

Posted Mar 30, 2017 - 20:28 EDT

Monitoring

All stores should now be back online and operational with no data loss. We apologize for this problem and thank you for your patience. We are continuing our final testing/cleanup procedures and as a result there may be a brief additional window of downtime (less than 20 minutes) impact as we finish restoration. We will post an additional update when we have completed this work.

Posted Mar 30, 2017 - 18:35 EDT

Update

At this time all but one computing node has been restored. All other stores should be up and running. We are commencing final testing/cleanup procedures and as a result there may be a brief additional window of downtime (less than 20 minutes) impact as we finish restoration. We will post an additional update when we have completed this work.

Posted Mar 30, 2017 - 17:22 EDT

Update

We are now in the final stages of bringing the remaining stores up. We expect to fully resolve this issue within the next few hours barring any unforeseen circumstances.

Posted Mar 30, 2017 - 16:20 EDT

Update

Our recovery efforts appear to be successful on one server and we are currently checking to ensure the recovery was complete. We will then be expanding this process to all remaining servers to restore all remaining stores.

Posted Mar 30, 2017 - 14:33 EDT

Update

We are in the final stages of testing a recovery method to bring sites back online without data loss. We expect these tests to be finished shortly. If they are successful we will bring back additional stores shortly thereafter. A root cause of the issue has also been identified and we will begin steps to mitigate this from happening in the future shortly after service is restored. Thank you for your patience during this process.

Posted Mar 30, 2017 - 13:58 EDT

Update

We continue to work on a recovery plan for the affected sites. We will post more information as soon as possible.

Posted Mar 30, 2017 - 13:18 EDT

Identified

We have been successful at identifying the issue and are working on bringing all sites back online as quickly as possible.

Posted Mar 30, 2017 - 11:45 EDT

Update

We are still working directly with our storage provider to assess and resolve this issue.

Posted Mar 30, 2017 - 11:15 EDT

Update

We are currently in direct contact with our storage provider in assessing the issue and coming up with a resolution plan. Please continue to monitor this page for more information.

Posted Mar 30, 2017 - 10:17 EDT

Update

We are still investigating the root cause of this storage connectivity issue.

Posted Mar 30, 2017 - 09:31 EDT

Investigating

We are currently investigating a storage connectivity issue affecting a number of our stores. Updates will be posted shortly.

Posted Mar 30, 2017 - 08:38 EDT