Degraded performance on API requests

Incident Report for Gridly

Postmortem

Impact

Major incident
Degraded performance on api.gridly.com, gridly is running very slowly.

Timeline

2021-10-04 UTC

01:09 PM - Degraded performance on api.gridly.com
01:35 PM - Enable maintenance mode for 45 minutes for upgrading internal database
02:36 PM - API is back to normal.

2021-10-05 UTC

01:50 AM - Partial outage on api.gridly.com, some internal services has been down for 2-3 minutes
02:40 AM - Deployed hotfix to production. API is back to normal.

Root cause analysis (RCA)

We had unexpected downtime from our internal service since Sep 29, 2021, related to license service (plan, seat & subscription). At this time, we scaled out to increase High Availability.
Our database was running under pressure because of high traffic, it’s still working but the operations & response time from database are very slowly, that’s why we experienced degraded performance on some API endpoints.
We scaled up & upgraded hardware specification on database side to help reducing workload & impact.
From perf insight & error tracking, we identified the root cause, it’s about blockers during processing tasks.
After identified the root cause, we deployed hotfix for this, optimize some logics on async.
All is back to normal, continue monitoring this kind of issue for next few days

Posted Oct 05, 2021 - 05:01 UTC

Resolved

Infrastructure workaround has been implemented and the service is operating normally. We have identified the cause for the issue and are working towards a resolution. We will provide post-mortem shortly.

Posted Oct 04, 2021 - 15:19 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 04, 2021 - 14:36 UTC

Update

We will keep the maintenance mode for next 15 mins.We will provide updates as necessary.

Posted Oct 04, 2021 - 14:16 UTC

Update

We're enabling maintenance mode for working on internal services. This is unexpected maintenance in less than 30 minutes from now.

Posted Oct 04, 2021 - 13:35 UTC

Investigating

We are currently investigating this issue.

Posted Oct 04, 2021 - 13:09 UTC

This incident affected: API Requests.