19th March 2026 Outage
Generated by James Mooring via incident.io on March 19, 2026 3:06 PM. All timestamps are local to Australia/Brisbane. The original document can be found here.
Summary
Astalty is currently unavailable for all customers. We are investigating the issue and working on a resolution.
Incident Timeline
Time | Event |
2026-03-19 | |
14:20:55 | Incident reported by James Mooring James Mooring reported the incident Severity: Critical Status: Investigating |
14:24:00 | Identified at Custom timestamp "Identified at" occurred |
14:32:06 | Status changed from Investigating → Monitoring James Mooring shared an update Status: Investigating → Monitoring We are seeing access being restored and we are monitoring the performance of the stall. |
14:37:40 | Incident resolved and closed James Mooring shared an update Status: Monitoring → Closed This issue has been resolved. |
Root cause analysis
Root cause
A database migration executed during deployment became stuck and began timing out on a high-traffic table.
This resulted in a buildup of database connections, which progressively degraded performance and ultimately caused the application to become unavailable.
Contributing factors
The migration was executed on a high-traffic table without sufficient safeguards to prevent using the after and before method in a database migration that requires metadata locks.
Maintenance mode messaging displayed outdated information due to previously stored values not being cleared
Technical analysis
Outage Cause
During deployment, a database migration attempted to modify a heavily used table using a column reordering operation. This type of change requires a metadata lock on the table.
Because the table was under active load, the lock could not be acquired promptly, causing the migration to stall and eventually time out.
While waiting for the lock, the migration held open database connections. These connections accumulated over time, eventually exhausting the available connection pool and significantly degrading database performance.
As a result, application servers were unable to process requests, leading to a full outage.
Incorrect Maintenance Messaging
As a safeguard to protect against an increasing number of connection attempts, James put Astalty into maintenance mode. However, our maintenance data was referring to an old maintenance period for January the 26th, 2026. This caused confusion amongst customers, as we have an upcoming planned maintenance on Friday night, and some customers thought that we were going to be unavailable until then.
Impact assessment
Customer impact
Most customers experienced extreme slowness during the incident window with some customers completely unable to access Astalty via the web or the native app
Duration of impact: approximately 11 minutes
Some customers were shown an incorrect maintenance message with outdated date and time information
System impact
Database performance degraded due to connection exhaustion
Application servers became unresponsive due to inability to acquire database connections
Business analysis
Temporary disruption to all active customers
Increase in inbound customer queries due to incorrect maintenance messaging
Resolution steps
Resolution summary
The issue was resolved by rolling back the deployment and restoring normal database operation by clearing the backlog of connections.
Detailed steps
Identified the migration as the source of database contention by inspecting AWS
Rolled back the deployment to stop the migration process
Cleared active database connections to relieve pressure on the database
Monitored system recovery as database performance returned to normal
Verification
Database performance metrics returned to normal levels
Application responsiveness restored across all services
Internal verification confirmed successful request processing
Lessons learned
What went well
The issue was identified and resolved quickly (within 11 minutes)
The team responded immediately and took decisive action to restore service
No data loss occurred
Areas for improvement
Stricter reviews on database migrations to high traffic tables
Additional training for the engineering team about database migrations that can cause locks
Maintenance mode messaging needs to be dynamically generated and validated and should not show past maintenance periods
Improved pre-deployment checks to ensure database column ordering methods can not be used
Improved reporting so that we have health checks in place to make sure the application can be reached
Key takeaways
Database column re-ordering should be forbidden
Customer-facing messaging must always be accurate, especially during incidents
Follow-ups
Follow-up | Owner |
|---|---|
Ensure past maintenance periods are not shown on the maintenance screen | James Mooring |
Implement safeguards to ensure that database column ordering can not be changed. | Nathan Wuiske |