Archived posting to the Leica Users Group, 2014/07/01

[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]

Subject: [Leica] Narrative about the extended LUG outage
From: richard at richardmanphoto.com (Richard Man)
Date: Tue, 1 Jul 2014 21:41:26 -0700
References: <53B3898D.2030005@mejac.palo-alto.ca.us>

This may help some of us:

"blah blah FIRE FIRE!! Blah blah, flux capacitors, blah Star drive
deflector shield on blah blah Abracadabra!" It works!
<https://www.google.com/search?safe=off&client=firefox-a&hs=aPt&rls=org.mozilla:en-US:official&channel=np&q=abracadabra&spell=1&sa=X&ei=3YyzU_74HsHHoAT6lIHABQ&ved=0CBwQBSgA>
:-)

Thanks Brian, being in the business, I actually do understand what you
wrote and went though similar things, except the server catching fire bits.
Ouch.



On Tue, Jul 1, 2014 at 9:24 PM, Brian Reid <reid at mejac.palo-alto.ca.us>
wrote:

> In case you care.
>
> Server computers that are engineered for reliability have two power
> supplies and two power cords. Power supplies are the most frequent
> component to fail in server computers, so having two of them makes it
> survive the outage of one.
>
> The server computer that had supported the LUG had two power supplies.
> They were stacked vertically, one on top of the other. Both power supplies
> had been running 24x7 for about 9 years, and their fans had sucked in a
> certain amount of lint. Lint is flammable. The bottom power supply failed,
> and the lint caught fire. The flame rose to the upper power supply and
> ignited its stored-up lint also. Like firestarters in a Franklin stove, the
> 20-second burst of flame was enough to ignite the various flammable items
> (including lint) in the main enclosure. The flash fire probably only lasted
> 40 or 50 seconds, but it was hot enough to destroy most of the solder
> traces that were near the power supplies on the circuit boards. There were
> various plastic tags on some of the cables, which added flammable material.
>
> You can go to the store and buy a laptop or a desktop computer, but you
> really can't go buy a server computer. Yes, this being silicon valley,
> there are stores around that sell server computers (Central Computer is the
> best of the lot) but buying a server computer at a retail store is like
> buying a bicycle at a department store. It's just not the same thing.
> Server computers are special-order, because there are so many variations on
> how they are built that no one can afford to keep good ones in inventory.
>
> The fire was on a Saturday morning, and I knew that the soonest I could
> even place an order for a replacement server was Monday, and even at
> rush-rush prices I wouldn't get it until Thursday. At the time a
> Saturday-to-Thursday outage seemed unconscionable. So I decided to move the
> LUG and its supporting software to the newest and emptiest of my half-dozen
> servers. It wasn't exactly a spare--it was running a few little things--but
> mostly it was idle.
>
> The LUG server had been running software from the era of its installation,
> about 2005. The new server was built with chips and components that the old
> software didn't understand, so I couldn't just restore the LUG server
> backups onto the new server. They wouldn't run. I had to get the new
> software working on the replacement server and then manuall move over each
> piece.
>
> I made the mistake of believing the operating system documentation, which
> detailed a function called "system upgrade". It was supposed to work they
> way Mac or Windows updates work--you let it do its thing for a while, and
> then you reboot and all is well. After running the system upgrade, nothing
> worked any more, including the few services that had been on that machine.
> After asking the experts, I realized that I was going to have to wipe the
> machine, do a clean install, get all of the necessary apps installed, and
> then restore both sets of backups (LUG server and previous contents of that
> server) to the clean system.
>
> So far this is not a crazy plan. I've done things like it many times
> before, though the 9-year software update gap made for a few challenges.
>
> Once I got all of the apps installed and the backups restored, I
> immediately typed the command to turn it all on
>         /local/mailman/bin/mailmanctl start
> and nothing happened. The error log showed a preposterous, deeply hard to
> believe error message.
>
> The wise person's first step in debugging strange failures on computers is
> to type the error message into a search engine (I use Bing) to see if other
> people had asked about it. To my great astonishment, no one had. This never
> happens. Somebody else *always* has the same problem and has asked about 
> it.
>
> I then started reading the source code of Mailman, trying to see what
> circumstances would cause it to generate that message.  Mailman is written
> in a language called Python. When you are having trouble like this, a good
> step is to explore "version skew". Mailman Version XXX works only with
> Python Version YYY. The versions of Python that are extant just now are
> 2.5, 2.6, 2.7, 3.2, 3.3, and 3.4.  This is an abnormally large spread of
> "current" versions, which usually means that the language developers have
> made incompatible changes and have to keep old versions around for apps
> that have come to depend on them.
>
> I tried all 6 of those Python versions. I got the same odd error in the
> 2.* versions, and absolute chaos in the 3.* versions. Since the version of
> Mailman that I wanted to use (2.1.18) failed the same way with all of the
> 2.* Python versions, I wiped the slate clean one last time and installed
> Python 2.7.
>
> Gonna have to find this problem the old-fashioned way.
>
> Many days pass as I read documentation, run tests, explore the software,
> use debuggers, create and read log files, all to no avail.
>
> Then I decided to instrument and log what was happening when
> Mailman/Python started up. Figuring out how much information to put in a
> log file is a black art. If you log too much, you will never find what you
> are looking for in the swamp of details. If you log too little, you
> probably won't log what you're looking for.
>
> After far too much time staring at the logs, I saw that Python was
> initializing from a library that was not listed in the Mailman
> docdumentation.
>
> An aside: language systems like Python tend to be aggressive in how they
> find libraries. They look around and if they find something that looks like
> a library, they use it. I'm sure the Python designers (none of whom is
> named Monty) thought they were doing the world a favor by making it go out
> and find its own libraries. "Autoconfiguration" run amok. Bad idea.
>
> This library was obsolete. In the 9 years of not upgrading, the Mailman
> software had changed the place where it kept certain library functions, and
> both of them were present in the version I was trying to run. The "wipe
> clean and reinstall" function only wiped the directories that it knew
> about, and this obsolete directory was not on its list -- it had been
> retired years ago -- so it didn't get removed by the "wipe clean" function.
>
> If I had run all 12 of the upgrades between Mailman 2.1.6 and 2.1.18, one
> of them would surely have deleted that newly-obsolete directory. But I
> didn't, so it was still there.
>
> When a complex computer system is using two different versions of the same
> library, with creation dates 7 years apart, it doesn't stand a chance of
> working.
>
> I typed the Unix command "rm -rf /local/mailman/Mailman/pythonlib/email"
> which got rid of the ancient and incompatible library
> and everything started working. Perfectly.
>
> There were hundreds of loose ends, and I spent the next week hunting them
> down, but it wasn't taking 18 hours a day and LUG mail was flowing while I
> did it.
>
> Thanks for listening.
> Brian Reid
> LUG Saloonkeeper and server wrangler
>
>
>
>
>
> _______________________________________________
> Leica Users Group.
> See http://leica-users.org/mailman/listinfo/lug for more information
>



-- 
// richard <http://www.richardmanphoto.com>
// http://facebook.com/richardmanphoto


In reply to: Message from reid at mejac.palo-alto.ca.us (Brian Reid) ([Leica] Narrative about the extended LUG outage)