Hello,
Ever since we migrated to the new, clustered, server farm solution back in November, the system has suffered performance degradation over those past few months. For the first few weeks, server monitoring (New Relic) showed a 1.0 apdex and very fast response times (less than 100ms). In recent months, however, the performance degraded to 0.7 apdex and slow response times (compared to the horsepower of the servers -- at 500ms and spiking into the thousands).
Long story short, we determined that some of our servers in the cluster died because of the backup process provided by our datacenter. Once we reported it, they reproduced the lockup issue and quickly released a fix about a day later. Since then, performance of the system has jumped back up to 1.0 apdex and very fast response times.
We are optimistic that this means the fixed backup process will allow for proper and reliable performance and will no longer cause any impact or potential outages as a result of locked up servers.
On another note, we had a victory in regards to the "Image Retrieval Error" conditions as well. We believe that to be resolved now and you can read all about the gritty details here, if you like geek-speak:
https://shrinktheweb.com/content/new-logging-output.html
All-in-all, these two issues were the last remaining issues that we have been battling. As a result of these fixes, the system currently has a 99.99999% success rate on captures, 99.99999% success rate on deliveries, and should be back to a 99.5% overall system uptime.
As always, we greatly appreciate the feedback and troubleshooting help from various users, and we appreciate the patience of our loyal users like you!
Best regards,
Brandon
https://shrinktheweb.com