Astalty Logo

19th March 2026 Outage

Generated by James Mooring via incident.io on March 19, 2026 3:06 PM. All timestamps are local to Australia/Brisbane. The original document can be found here.

Summary

Astalty is currently unavailable for all customers. We are investigating the issue and working on a resolution.

Incident Timeline

Time

Event

2026-03-19

14:20:55

Incident reported by James Mooring

James Mooring reported the incident

Severity: Critical

Status: Investigating

14:24:00

Identified at

Custom timestamp "Identified at" occurred

14:32:06

Status changed from Investigating → Monitoring

James Mooring shared an update

Status: Investigating → Monitoring

We are seeing access being restored and we are monitoring the performance of the stall.

14:37:40

Incident resolved and closed

James Mooring shared an update

Status: Monitoring → Closed

This issue has been resolved.

Root cause analysis

Root cause

A database migration executed during deployment became stuck and began timing out on a high-traffic table.

This resulted in a buildup of database connections, which progressively degraded performance and ultimately caused the application to become unavailable.

Contributing factors

  • The migration was executed on a high-traffic table without sufficient safeguards to prevent using the after and before method in a database migration that requires metadata locks.

  • Maintenance mode messaging displayed outdated information due to previously stored values not being cleared

Technical analysis

Outage Cause

During deployment, a database migration attempted to modify a heavily used table using a column reordering operation. This type of change requires a metadata lock on the table.

Because the table was under active load, the lock could not be acquired promptly, causing the migration to stall and eventually time out.

While waiting for the lock, the migration held open database connections. These connections accumulated over time, eventually exhausting the available connection pool and significantly degrading database performance.

As a result, application servers were unable to process requests, leading to a full outage.

Incorrect Maintenance Messaging

As a safeguard to protect against an increasing number of connection attempts, James put Astalty into maintenance mode. However, our maintenance data was referring to an old maintenance period for January the 26th, 2026. This caused confusion amongst customers, as we have an upcoming planned maintenance on Friday night, and some customers thought that we were going to be unavailable until then.

Impact assessment

Customer impact

  • Most customers experienced extreme slowness during the incident window with some customers completely unable to access Astalty via the web or the native app

  • Duration of impact: approximately 11 minutes

  • Some customers were shown an incorrect maintenance message with outdated date and time information

System impact

  • Database performance degraded due to connection exhaustion

  • Application servers became unresponsive due to inability to acquire database connections

Business analysis

  • Temporary disruption to all active customers

  • Increase in inbound customer queries due to incorrect maintenance messaging

Resolution steps

Resolution summary

The issue was resolved by rolling back the deployment and restoring normal database operation by clearing the backlog of connections.

Detailed steps

  • Identified the migration as the source of database contention by inspecting AWS

  • Rolled back the deployment to stop the migration process

  • Cleared active database connections to relieve pressure on the database

  • Monitored system recovery as database performance returned to normal

Verification

  • Database performance metrics returned to normal levels

  • Application responsiveness restored across all services

  • Internal verification confirmed successful request processing

Lessons learned

What went well

  • The issue was identified and resolved quickly (within 11 minutes)

  • The team responded immediately and took decisive action to restore service

  • No data loss occurred

Areas for improvement

  • Stricter reviews on database migrations to high traffic tables

  • Additional training for the engineering team about database migrations that can cause locks

  • Maintenance mode messaging needs to be dynamically generated and validated and should not show past maintenance periods

  • Improved pre-deployment checks to ensure database column ordering methods can not be used

  • Improved reporting so that we have health checks in place to make sure the application can be reached

Key takeaways

  • Database column re-ordering should be forbidden

  • Customer-facing messaging must always be accurate, especially during incidents

Follow-ups

Follow-up

Owner

Ensure past maintenance periods are not shown on the maintenance screen

James Mooring

Implement safeguards to ensure that database column ordering can not be changed.

Nathan Wuiske