Degraded API performance and connection timeout

Incident Report for Gridly

Postmortem

Timeline on 2021-08-29 UTC

05:13 AM - Degraded API performance and timeout connections
05:26 AM - Timeout connections are under control. API is back to normal
06:15 AM - Deployed hotfix and improvements to production cluster.

Root cause analysis (RCA)

Internal endpoint rotated IP addresses unexpectedly
Our edge proxy servers had timeout connections on networking between services because of expired IPs
Reload proxy layer for re-init connections to latest IPs
Deploy hotfixes & improvements on DNS caching layer for actively update changes from internal endpoints

Posted Aug 30, 2021 - 04:18 UTC

Resolved

We're experiencing an elevated level of API timeout. This issue has been resolved on infrastructure side.

Posted Aug 29, 2021 - 05:15 UTC