Degraded API performance and connection timeout
Incident Report for Gridly
Postmortem

Impact

  • Critical incident
  • Outage on api.gridly.com for 3 hours and 8 minutes

Timeline on 2021-09-07 UTC

  • 07:07 PM - High rate of network connectivity failures to api.gridly.com
  • 10:10 PM - Restart proxy layer & deploy hotfix
  • 10:15 AM - API is back to normal.

Root cause analysis (RCA)

  • It’s small interruption from our internal load balancer between micro services, it can be the changes on IP or peering network broken, but it appears in very short time (expect ~1min or less)
  • We do not have properly configurations on our egde proxies yet, so that during facing small interruption from load balancer internally, nginx proxy was not working anymore.
  • Edge proxies down and lost connection. Outage on entire API endpoint
  • On Aug 28, we already have this kind of issue, we also deployed hotfix for that, but somehow it’s missing or not cover all the interruption cases
  • We also deployed new strategy plan for handling more interruption cases.
  • Continue monitoring this kind of issue for next few days
Posted Sep 07, 2021 - 22:34 UTC

Resolved
We are currently investigating high rate of network connectivity failures to api.gridly.com. We have identified the cause for the issue and are working towards a resolution.
Posted Sep 07, 2021 - 19:00 UTC