19th March 2026 Outage

Generated by James Mooring via incident.io on March 19, 2026 3:06 PM. All timestamps are local to Australia/Brisbane. The original document can be found here.

Summary

Astalty is currently unavailable for all customers. We are investigating the issue and working on a resolution.

Incident Timeline

Time	Event
2026-03-19
14:20:55	Incident reported by James Mooring James Mooring reported the incident Severity: Critical Status: Investigating
14:24:00	Identified at Custom timestamp "Identified at" occurred
14:32:06	Status changed from Investigating → Monitoring James Mooring shared an update Status: Investigating → Monitoring We are seeing access being restored and we are monitoring the performance of the stall.
14:37:40	Incident resolved and closed James Mooring shared an update Status: Monitoring → Closed This issue has been resolved.

Root cause analysis

Root cause

A database migration executed during deployment became stuck and began timing out on a high-traffic table.

This resulted in a buildup of database connections, which progressively degraded performance and ultimately caused the application to become unavailable.

Contributing factors

The migration was executed on a high-traffic table without sufficient safeguards to prevent using the after and before method in a database migration that requires metadata locks.
Maintenance mode messaging displayed outdated information due to previously stored values not being cleared

Technical analysis

Outage Cause

During deployment, a database migration attempted to modify a heavily used table using a column reordering operation. This type of change requires a metadata lock on the table.

Because the table was under active load, the lock could not be acquired promptly, causing the migration to stall and eventually time out.

While waiting for the lock, the migration held open database connections. These connections accumulated over time, eventually exhausting the available connection pool and significantly degrading database performance.

As a result, application servers were unable to process requests, leading to a full outage.

Incorrect Maintenance Messaging

As a safeguard to protect against an increasing number of connection attempts, James put Astalty into maintenance mode. However, our maintenance data was referring to an old maintenance period for January the 26th, 2026. This caused confusion amongst customers, as we have an upcoming planned maintenance on Friday night, and some customers thought that we were going to be unavailable until then.

Impact assessment

Customer impact

Most customers experienced extreme slowness during the incident window with some customers completely unable to access Astalty via the web or the native app
Duration of impact: approximately 11 minutes
Some customers were shown an incorrect maintenance message with outdated date and time information

System impact

Database performance degraded due to connection exhaustion
Application servers became unresponsive due to inability to acquire database connections

Business analysis

Temporary disruption to all active customers
Increase in inbound customer queries due to incorrect maintenance messaging

Resolution steps

Resolution summary

The issue was resolved by rolling back the deployment and restoring normal database operation by clearing the backlog of connections.

Detailed steps

Identified the migration as the source of database contention by inspecting AWS
Rolled back the deployment to stop the migration process
Cleared active database connections to relieve pressure on the database
Monitored system recovery as database performance returned to normal

Verification

Database performance metrics returned to normal levels
Application responsiveness restored across all services
Internal verification confirmed successful request processing

Lessons learned

What went well

The issue was identified and resolved quickly (within 11 minutes)
The team responded immediately and took decisive action to restore service
No data loss occurred

Areas for improvement

Stricter reviews on database migrations to high traffic tables
Additional training for the engineering team about database migrations that can cause locks
Maintenance mode messaging needs to be dynamically generated and validated and should not show past maintenance periods
Improved pre-deployment checks to ensure database column ordering methods can not be used
Improved reporting so that we have health checks in place to make sure the application can be reached

Key takeaways

Database column re-ordering should be forbidden
Customer-facing messaging must always be accurate, especially during incidents

Follow-ups

Follow-up	Owner
Ensure past maintenance periods are not shown on the maintenance screen	James Mooring
Implement safeguards to ensure that database column ordering can not be changed.	Nathan Wuiske