Home Messages Index
[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index

Re: Tales of a Net Geek

__/ [ Kelsey Bjarnason ] on Tuesday 07 March 2006 13:06 \__

> For the amusement of any who might care.
> Running a Dell box.  2850, I think - mind's going after another 9AM-4AM
> shift. In any case, redundant power supplies, dual processor, RAID SCSI
> drives, yadda yadda yadda.  All set up to be maximally reliable, right?
> Right.
> Thing's got three NICs in it.  One for the net feed, one for LAN
> connections, one feeding out another LAN full of customers.  All very
> wonderful.  Been working, day and night, for a month as we test it all
> out.  I'm set up as a customer, and there's about a dozen "real" customers
> on it, too.
> System's been in testing for a month.  It's a rock.
> Until Saturday.
> Saturday, it went down for a short time.  Heh.  Could be a failure in the
> feed, who knows.  Nothing obviously wrong, and it came back up shortly.
> Sunday... no, let's back up a step.  Reliability is key, right?  Keeping
> the customers online is key, right?  Right.  So to ensure we have maximal
> reliability, we have SNMP monitoring of the box happening, with automatic
> messaging to our cell phones should anything more than the most trivial
> issue arise.
> My cell has been silent since Friday.
> So Saturday it hiccups, or the feed does.  No big deal.  Recovers very
> shortly, that's the end of it.  Not even enough, IIRC, to get a call on my
> cell.
> Come in Monday, and there's a slew of support calls.  Only a dozen
> customers, and of those, only about six who have actually registered to
> use the thing.  But there's about thirty voice mails, all to the effect
> "it's down".
> So it is.  Down all day Sunday, all night, right up until I get in Monday
> morning.
> Hmm.  Nothing's changed which should have caused this.  What happens if I
> restart the firewall, though, manually?  Back up and running.  Woot!  Now
> the customers can be happy while I figure out what happened.
> An hour later, more calls.  Down again.  Hmm.  Restart the firewall, it's
> back up.  Toss a cron job in to do this every 15 minutes or so.
> Except...
> It stays up maybe half an hour.  Then ten minutes.  Then two.  Then about
> 30 seconds per firewall restart.
> There's no reason on God's Green Earth why the firewall should be behaving
> this way; it hasn't changed in any significant way in a month.
> Long and short, this thing is *dead*.  And we have no idea why.  And it's
> now 5PM.  And at 8:00 in the morning, we have a mass migration of 200
> customers coming onto the machine... which is the router and firewall
> feeding the whole blinking lot of them their DSL traffic.
> Isn't it lovely how life does that to you?  When you *don't* need it to be
> rock-solid stable, it is.  When you do need it to be, it dies.  Exactly
> when dying is going to cause a large portion of your high speed customer
> base to scream blue murder.
> So we toss up a rinky-dink little POS hardware router really designed to
> handle a half-dozen LAN clients, tops, thinking how wonderful it's going
> to be trying to run the whole bloody lot of these folks off this little
> one lunger.  It's enough to make you cringe.
> So around 2AM, when we'll inconvenience the fewest people, we put the real
> machine back online for some heavy testing.  Two hours later, it's back up
> and running, everything's ship shape and Bristol fashion.  Problem? This
> fancy, sexy, four-port-in-one-slot NIC we're using packed it in... but not
> enough to either completely fail to respond - hence the lack of SNMP
> alerts - nor enough to toss system messages about, say, the NIC going up
> and down.  Nope, just enough to randomly, about every 100 packets or so,
> give up on the connection.  Just enough to let the SNMP poll through, but
> that's about it.
> Solution?  Rip a NIC out of whatever's handy - happens to be brand new -
> stick it in and get back online.  Sigh.  Welcome to Camp Chaos.
> Cheap-ass network adapter: $15.
> 3 pots of coffee for the net geek: $4.
> 2 bottles of aspirin for the manager: $9.
> A system that actually works: priceless.

Didn't the logs indicate why cron jobs have ground to a halt? These are
usually quite versbose. You can also take a look at /var/log . Did you truly
think it was software related? I guess I would have thought so too, but the
lesson to be learned is that cheap hardware should be treated with
disrespect. *smile*

Best wishes,


[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index