Major Victory in Resolving Lingering Stability Issues

5 posts / 0 new
Last post
puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

Hello,

Ever since we migrated to the new, clustered, server farm solution back in November, the system has suffered performance degradation over those past few months. For the first few weeks, server monitoring (New Relic) showed a 1.0 apdex and very fast response times (less than 100ms). In recent months, however, the performance degraded to 0.7 apdex and slow response times (compared to the horsepower of the servers -- at 500ms and spiking into the thousands).

Long story short, we determined that some of our servers in the cluster died because of the backup process provided by our datacenter. Once we reported it, they reproduced the lockup issue and quickly released a fix about a day later. Since then, performance of the system has jumped back up to 1.0 apdex and very fast response times.

We are optimistic that this means the fixed backup process will allow for proper and reliable performance and will no longer cause any impact or potential outages as a result of locked up servers.

On another note, we had a victory in regards to the "Image Retrieval Error" conditions as well. We believe that to be resolved now and you can read all about the gritty details here, if you like geek-speak:

https://shrinktheweb.com/content/new-logging-output.html

All-in-all, these two issues were the last remaining issues that we have been battling. As a result of these fixes, the system currently has a 99.99999% success rate on captures, 99.99999% success rate on deliveries, and should be back to a 99.5% overall system uptime.

As always, we greatly appreciate the feedback and troubleshooting help from various users, and we appreciate the patience of our loyal users like you! Smile

Best regards,

Brandon
https://shrinktheweb.com

puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

[GEEK-SPEAK]

Update: It turns out that the elusive "Image Retrieval" errors were greatly reduced by the DNS caching solution put on all nodes in the server farm. However, they still occurred 1 in 1,000,000 requests. That's much better than the 1 in 50,000 or so that the errors reached at their peak, and even though 1 in 50,000 easily beats our uptime SLA by a longshot, these types of errors just eat away at my psyche! So, I had to dig deeper.

After finally finding the right place to put my debugging code, I learned that a second issue of the connection being "reset by peer" was causing this failure as well. Since it happens so infrequently, I doubt that our datacenter will do anything about these types of transient network issues, but I will report it to them anyway.

The good news is that this finally made me realize that even though I put retry logic into the image retrieval function, I did not check to see if the connection was still valid. So now, with a minor change, it is highly unlikely for this error to occur again. Smile

puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

Arrrggghhh!!!

So the "Image Retrieval Error" error still occurs occasionally, sadly. I've got checks for the connection in there now, but it looks like once the connection is severed, that instance of the script cannot reconnect. Signs point to a possible problem with the OS, PHP, Curl, or a combination thereof.

None of the involved service providers will investigate an error that fails 1 in 1,000,000 times, because that's 99.9999% success, which is far beyond any SLA guarantee. Furthermore, it seems that the problem is not related to our datacenter, their upstream providers, or our storage provider.

So, it is either:

  • Random disconnects on routes between our datacenters and our storage provider, OR
  • The aforementioned possibility of a bug in software we do not control

Either one of those scenarios means I have few options to permanently stamp out this error.

I could go on a tear and try newer, edge versions of PHP and/or Curl, but I'm not willing to go to such lengths for such an intermittent error. That would drive me more mad than having to accept the error!

I could also consider switching back from a cloud storage provider to a local storage cloud, but it would have to make financial sense (not sure it does right now) AND it would mean having to migrate more than 100,000,000 images currently stored. I've been through that once before when we had 1/4th as many images stored. What a nightmare. Hah.

For now, I have added a new error of "Connection to Storage Failed" that signals that we retried the connection several times but were unable to reconnect. So, if that is always the case when these disconnects occur, that error will likely replace "Image Retrieval Failed" (which will still log, if the re-connection is successful but download still fails).

Looking at it from the positive perspective, at least we're delivering at a 99.9999% success rate. That's quite impressive. I was just really striving for the unrealistic 100%. Smile

Cheers,

Brandon

puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

Hooray! I may have finally (accidentally) figured out this on-going "Image Retrieval Error" situation. I was no longer focusing on this issue, but I have spent the last 45 days completely re-factoring, overhauling, and improving ShrinkTheWeb's delivery script (xino/xian) code.

In doing so, I wrote an extensive testing guide with many, many scenarios. One of those I didn't even expect to cause an issue, would cause this error, consistently, under very specific circumstances.

I had reached out to a user, whose account seemed to have 95% of these errors (seemed odd to me), but they did not respond. If they had, I might have discovered this sooner.

[GEEK-SPEAK]
As it turns out, if a user makes a "Refresh On-Demand" request for a URL that we do not have in the system (could have been deleted by retention or could just be a new request), and THEN you come back while it is still pending, the code will "think" the image exists and will try to download it --yielding the elusive error. Voila!

Another complication to all of this was that the Amazon S3 library that I used was coded very well, but for some reason, did not properly return errors from the function call. So, I was never really getting the feedback I needed, which made me go to great lengths just to get to the heart of the DNS/routing issues.

Now, with a simple modification to that S3 class, I am able to programmatically detect various scenarios. Finally, it works as expected!

In short, the DNS/routing issues were 99% of the problem, but I'm hopeful that this last, quirky specific case accounts for the remaining 1%. It is a developer's dream to eliminate 100% of these types of pesky errors, and I so hope that this is it.

I hope to release my updated xino/xian delivery script code in the next few days, and that should nip this issue in the bud.

Ciao!

Brandon

puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

Update: I just wanted to quickly update this thread. It has been almost a month since I deployed my latest "fix" for the "Image Retrieval Failed" error, and it appears to have been successful. There have been ZERO (0) errors of this type since then.

GEEK-SPEAK
So, the following reasons all played a part in the failures:

  • 3% due to direct, cached calls to Xian where the connection failed and was not reconnected
  • 80% due to DNS rate-limiting issues (public DNS)
  • 15% due to intermittent/transient connection or routing issues
  • 2% due to a timing issue or race condition in my Xino delivery script

Finally. Chalk one up for the team!

So now, the only related error I've seen logged is "Cached Image Missing. Refresh queued.", which is good that we properly detect these now that I've fixed the S3 class error handling. I opened a ticket with Amazon Web Services regarding the occasional missing images, but they said that since it is happening about 1:10,000,000, they will not investigate because it is far beyond their 99.9% SLA guarantee. They said that images should not go missing but could happen and that timing issues on recent uploads may cause that condition (not fully replicated to all of their nodes).

I now get about 15-20 notifications per day of broken connections to S3 storage, but the reconnect/retry logic must be working; since no errors have occured as a result. So that's encouraging!

In any case, I think I can mark this one as a job well done.

ShrinkTheWeb® (About STW) is another innovation by Neosys Consulting
Contact Us | PagePix Benefits | Learn More | STW Forums | Our Partners | Privacy Policy | Terms of Use

Announcing Javvy, the best crypto exchange and wallet solution (coming soon!)

©2018 ShrinkTheWeb. All rights reserved. ShrinkTheWeb is a registered trademark of ShrinkTheWeb.