Makeswift's Incident on NOVEMBER 10, 2025
November 13th, 2025
Incident Report
On November 10th, 2025, Makeswift experienced a service disruption that impacted the API (api.makeswift.com) for approximately 1 hour and 22 minutes. During this period, the API responded with 500 errors, preventing a subset of customer websites and the Visual Builder from functioning as expected. We sincerely apologize for the disruption and impact this had on your business operations. We recognize that Makeswift’s reliability is essential to your success, and we’re taking concrete steps to prevent this from happening again.What Happened?
What Happened?The service disruption was caused by memory exhaustion in one of our Redis instances, which the API depends on for caching.In the week leading up to the incident, we had been rolling out various caching improvements to the Makeswift API. Historically our cache had been operating well below maximum memory capacity but these changes resulted in our API reaching maximum memory capacity and exhibiting unexpected behavior.When Redis ran out of memory, API operations that interacted with the cache began failing, leading to widespread errors. Given that we had a LRU eviction policy, this failure mode wasn’t expected as Redis should evict the oldest cache entries to make space for new ones automatically. We’re continuing a detailed internal review to better understand why Redis failed to recover as expected under its LRU eviction policy. The current understanding is that cache growth exceeded operational limits, triggering out-of-memory protection and preventing normal cache behavior.
Incident Response Timeline
All times below are in the Eastern Daylight Timezone on November 10th, 2025.
- 5:28 PM: Our Redis instance reached maximum capacity; there were no OOM errors yet and the service was healthy.
- 6:11 PM: Started running performance tests to verify recently deployed changes.
- 6:17 PM: The API began responding with 500 errors due to “OOM command not allowed under OOM prevention.” errors from Redis.
- 6:55 PM: One of our alerts which monitors the health of live pages is triggered, paging on-call engineers.
- 6:57 PM: Started investigating the cause of the problem.
- 7:13 PM: Identified “OOM command not allowed under OOM prevention.” errors as probable cause.
- 7:39 PM: Redis cache was flushed, resulting in immediate recovery of the service.
Service was disrupted for approximately 1 hour and 22 minutes.
Short Term Resolution
- The Redis cache was cleared, immediately restoring the service to a healthy state.
- Rolled back changes that resulted in increased memory usage.
- Monitoring and alerting for Redis performance and errors were expanded to ensure faster detection of similar issues.
- Performance testing continues to validate caching behavior under higher load conditions.
Long Term Mitigation
- We are enhancing observability across caching and API layers to identify potential issues before they affect customers.
- We are refining our incident response processes to ensure faster response when alerts trigger.
- We will update our API so that cache failures result in degraded performance as opposed to service disruption.
- We are working on reducing alerting noise to have a faster response time for new errors, resulting in lower MTTRs.
- We will update our staging environment so that it more closely matches production so that we can reliably identify performance issues before it impacts customers.
- We are investing in a dedicated API architecture designed to serve live site traffic independently from the builder, reducing future risk of widespread disruption.
We deeply apologize for the disruption this caused and sincerely appreciate your patience as we strengthen Makeswift’s reliability. If you have any questions or concerns, please contact us at support@makeswift.com.