Degraded API performance and connection timeout

Incident Report for Gridly

Postmortem

Impact

Critical incident
Outage on api.gridly.com for 3 hours and 8 minutes

Timeline on 2021-09-07 UTC

07:07 PM - High rate of network connectivity failures to api.gridly.com
10:10 PM - Restart proxy layer & deploy hotfix
10:15 AM - API is back to normal.

Root cause analysis (RCA)

It’s small interruption from our internal load balancer between micro services, it can be the changes on IP or peering network broken, but it appears in very short time (expect ~1min or less)
We do not have properly configurations on our egde proxies yet, so that during facing small interruption from load balancer internally, nginx proxy was not working anymore.
Edge proxies down and lost connection. Outage on entire API endpoint
On Aug 28, we already have this kind of issue, we also deployed hotfix for that, but somehow it’s missing or not cover all the interruption cases
We also deployed new strategy plan for handling more interruption cases.
Continue monitoring this kind of issue for next few days

Posted Sep 07, 2021 - 22:34 UTC

Resolved

We are currently investigating high rate of network connectivity failures to api.gridly.com. We have identified the cause for the issue and are working towards a resolution.

Posted Sep 07, 2021 - 19:00 UTC