__/ [ Kelsey Bjarnason ] on Tuesday 07 March 2006 13:06 \__
> For the amusement of any who might care.
> Running a Dell box. 2850, I think - mind's going after another 9AM-4AM
> shift. In any case, redundant power supplies, dual processor, RAID SCSI
> drives, yadda yadda yadda. All set up to be maximally reliable, right?
> Thing's got three NICs in it. One for the net feed, one for LAN
> connections, one feeding out another LAN full of customers. All very
> wonderful. Been working, day and night, for a month as we test it all
> out. I'm set up as a customer, and there's about a dozen "real" customers
> on it, too.
> System's been in testing for a month. It's a rock.
> Until Saturday.
> Saturday, it went down for a short time. Heh. Could be a failure in the
> feed, who knows. Nothing obviously wrong, and it came back up shortly.
> Sunday... no, let's back up a step. Reliability is key, right? Keeping
> the customers online is key, right? Right. So to ensure we have maximal
> reliability, we have SNMP monitoring of the box happening, with automatic
> messaging to our cell phones should anything more than the most trivial
> issue arise.
> My cell has been silent since Friday.
> So Saturday it hiccups, or the feed does. No big deal. Recovers very
> shortly, that's the end of it. Not even enough, IIRC, to get a call on my
> Come in Monday, and there's a slew of support calls. Only a dozen
> customers, and of those, only about six who have actually registered to
> use the thing. But there's about thirty voice mails, all to the effect
> "it's down".
> So it is. Down all day Sunday, all night, right up until I get in Monday
> Hmm. Nothing's changed which should have caused this. What happens if I
> restart the firewall, though, manually? Back up and running. Woot! Now
> the customers can be happy while I figure out what happened.
> An hour later, more calls. Down again. Hmm. Restart the firewall, it's
> back up. Toss a cron job in to do this every 15 minutes or so.
> It stays up maybe half an hour. Then ten minutes. Then two. Then about
> 30 seconds per firewall restart.
> There's no reason on God's Green Earth why the firewall should be behaving
> this way; it hasn't changed in any significant way in a month.
> Long and short, this thing is *dead*. And we have no idea why. And it's
> now 5PM. And at 8:00 in the morning, we have a mass migration of 200
> customers coming onto the machine... which is the router and firewall
> feeding the whole blinking lot of them their DSL traffic.
> Isn't it lovely how life does that to you? When you *don't* need it to be
> rock-solid stable, it is. When you do need it to be, it dies. Exactly
> when dying is going to cause a large portion of your high speed customer
> base to scream blue murder.
> So we toss up a rinky-dink little POS hardware router really designed to
> handle a half-dozen LAN clients, tops, thinking how wonderful it's going
> to be trying to run the whole bloody lot of these folks off this little
> one lunger. It's enough to make you cringe.
> So around 2AM, when we'll inconvenience the fewest people, we put the real
> machine back online for some heavy testing. Two hours later, it's back up
> and running, everything's ship shape and Bristol fashion. Problem? This
> fancy, sexy, four-port-in-one-slot NIC we're using packed it in... but not
> enough to either completely fail to respond - hence the lack of SNMP
> alerts - nor enough to toss system messages about, say, the NIC going up
> and down. Nope, just enough to randomly, about every 100 packets or so,
> give up on the connection. Just enough to let the SNMP poll through, but
> that's about it.
> Solution? Rip a NIC out of whatever's handy - happens to be brand new -
> stick it in and get back online. Sigh. Welcome to Camp Chaos.
> Cheap-ass network adapter: $15.
> 3 pots of coffee for the net geek: $4.
> 2 bottles of aspirin for the manager: $9.
> A system that actually works: priceless.
Didn't the logs indicate why cron jobs have ground to a halt? These are
usually quite versbose. You can also take a look at /var/log . Did you truly
think it was software related? I guess I would have thought so too, but the
lesson to be learned is that cheap hardware should be treated with