Potential for Temporarily Delayed New Captures

3 posts / 0 new
Last post
puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

For the past few months, we have been battling with Rackspace regarding issues where the networking does not provision properly for new cloud server instances.

Currently, we are seeing VERY HIGH failure rates for new instances. That means that if we have a large burst of new requests, we may not be able to scale in order to handle them as quickly as we normally would. The small bit of good news is that we saw this coming and began preparing. We have been "pulling all-nighters" for the past week, in order to get as many generators online as possible. So, while we normally just run 35 generators and scale to meet needs, we are currently running about 150 generators but cannot scale to meet needs.

We are working diligently to resolve the issue and are already in the process of deploying 100 new "always on" generators with another one of our co-location providers. So we are aware of any slowdowns that may occur, but we hope to have the situation remedied within a few days.

GEEK-SPEAK
At first, it was a costly nuisance (as we were charged for occasionally broken instances). However, it quickly escalated to a 20% failure rate and this month jumped to 40% failure rate. Not only is that very costly (as they continued to charge us and deny fault), but they also pointed the finger at one of our other vendors that handles auto-scaling of our cloud instances. I showed Rackspace the following from the instance logs:

Bringing up interface eth0:
Determining IP information for eth0... failed.
[FAILED]
Bringing up interface eth1:
Determining IP information for eth0... failed.
[FAILED]
SIOCADDRT: Network is unreachable
SIOCADDRT: Network is unreachable

However, they insisted that the cloud scaling vendor used proprietary images and those had to be ruled out first. After working with the other vendor's engineering team (and great cost of my time and their time), we finally determined that it was not something within their control.

As of Thursday, we began experiencing 100% launch failures!! This is when we really began to panic. I spent another late night investigating the failed instances and provided all of my conclusive data to Rackspace. When I came online this morning, Rackspace had apologized and said that they see evidence for the cause of the launch failures. They now know where to look and asked me for just a little more investigation. That data has confirmed their suspicions on the culprit (a networking reset automation; possible file corruption on launch), and they are working to address the issue.

We are hopeful that they resolve it soon, so that we can move on with our lives. Wink lol

Update: As I write this, I see that Rackspace has confirmed this to be a problem with a networking configuration file possibly left on the image created by the cloud scaling vendor. So "'round and 'round, we go", but at least they say they should have an easy fix and I can update the cloud scaling vendor in order to avoid these headaches for their other customers.

puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

Quick update: After a few more hours of back-and-forth troubleshooting, Rackspace and I have managed to isolate two root causes for the failures.

  1. My suspicion of a faulty switch port was correct. This was causing the intermittent failures and made it very difficult to pinpoint the other cause for failures (#2)
  2. The cloud scaling provider would occasionally produce a networking config file for a non-existent interface, specifically: /etc/sysconfig/network-scripts/ifcfg-eth0:1

    Whenever this occurred, the networking agent would fail to assign the network connectivity properly. As soon as that faulty config file was removed, the agent would work as expected.

So the good news is that we know what is the problem, and it was a doozie. Wink

The only thing left now is to wait for final verification that the switch port is the final, remaining issue. Then, we will be able to make a new image that will work 100% of the time. Finally! Yay!

puravida's picture
puravida
Jedi Warrior
Offline
Joined: 09/01/2007
Visit puravida's Website

Update: After much ado, I believe that Rackspace's techs have finally resolved the faulty switch port issue. I was able to help them isolate which port and now we are seeing 100% launch success. For something that we have come to expect, I am thrilled to be able to expect it once again...

So, I will finish the previously interrupted deployment of the latest generator code (no impact to current users) and then I will re-open the floodgates for auto-scaling on demand.

Everything should be "back to normal" within the hour, except that the overhaul and upgrades to our system has our generators capturing 300% more captures than before. So, each capture generator is working as well as three(3) prior version generators, AND we have about 100 "always on" (versus 35-50 before) plus plans to add another "100" within the next 30-45 days. Things are really going to be moving and grooving.

ShrinkTheWeb® (About STW) is another innovation by Neosys Consulting
Contact Us | PagePix Benefits | Learn More | STW Forums | Our Partners | Privacy Policy | Terms of Use

Announcing Javvy, the best crypto exchange and wallet solution (coming soon!)

©2018 ShrinkTheWeb. All rights reserved. ShrinkTheWeb is a registered trademark of ShrinkTheWeb.