Archived posting to the Leica Users Group, 2014/07/02

[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]

Subject: [Leica] Narrative about the extended LUG outage
From: pdzwig at summaventures.com (Peter Dzwig)
Date: Thu, 03 Jul 2014 00:24:02 +0100
References: <53B3898D.2030005@mejac.palo-alto.ca.us>

Congrats Brian. Sounds like a lot of work?

Was it "apt-get autoremove" that failed?

I have been finding that a pain on my systems too.

Thanks for all your hard work on our behalf. We are all very grateful for
keeping the LUG going.

Peter

On 02/07/2014 05:24, Brian Reid wrote:
> In case you care.
> 
> Server computers that are engineered for reliability have two power 
> supplies and
> two power cords. Power supplies are the most frequent component to fail in
> server computers, so having two of them makes it survive the outage of one.
> 
> The server computer that had supported the LUG had two power supplies. 
> They were
> stacked vertically, one on top of the other. Both power supplies had been
> running 24x7 for about 9 years, and their fans had sucked in a certain 
> amount of
> lint. Lint is flammable. The bottom power supply failed, and the lint 
> caught
> fire. The flame rose to the upper power supply and ignited its stored-up 
> lint
> also. Like firestarters in a Franklin stove, the 20-second burst of flame 
> was
> enough to ignite the various flammable items (including lint) in the main
> enclosure. The flash fire probably only lasted 40 or 50 seconds, but it 
> was hot
> enough to destroy most of the solder traces that were near the power 
> supplies on
> the circuit boards. There were various plastic tags on some of the cables, 
> which
> added flammable material.
> 
> You can go to the store and buy a laptop or a desktop computer, but you 
> really
> can't go buy a server computer. Yes, this being silicon valley, there are 
> stores
> around that sell server computers (Central Computer is the best of the 
> lot) but
> buying a server computer at a retail store is like buying a bicycle at a
> department store. It's just not the same thing. Server computers are
> special-order, because there are so many variations on how they are built 
> that
> no one can afford to keep good ones in inventory.
> 
> The fire was on a Saturday morning, and I knew that the soonest I could 
> even
> place an order for a replacement server was Monday, and even at rush-rush 
> prices
> I wouldn't get it until Thursday. At the time a Saturday-to-Thursday outage
> seemed unconscionable. So I decided to move the LUG and its supporting 
> software
> to the newest and emptiest of my half-dozen servers. It wasn't exactly a
> spare--it was running a few little things--but mostly it was idle.
> 
> The LUG server had been running software from the era of its installation, 
> about
> 2005. The new server was built with chips and components that the old 
> software
> didn't understand, so I couldn't just restore the LUG server backups onto 
> the
> new server. They wouldn't run. I had to get the new software working on the
> replacement server and then manuall move over each piece.
> 
> I made the mistake of believing the operating system documentation, which
> detailed a function called "system upgrade". It was supposed to work they 
> way
> Mac or Windows updates work--you let it do its thing for a while, and then 
> you
> reboot and all is well. After running the system upgrade, nothing worked 
> any
> more, including the few services that had been on that machine. After 
> asking the
> experts, I realized that I was going to have to wipe the machine, do a 
> clean
> install, get all of the necessary apps installed, and then restore both 
> sets of
> backups (LUG server and previous contents of that server) to the clean 
> system.
> 
> So far this is not a crazy plan. I've done things like it many times 
> before,
> though the 9-year software update gap made for a few challenges.
> 
> Once I got all of the apps installed and the backups restored, I 
> immediately
> typed the command to turn it all on
>     /local/mailman/bin/mailmanctl start
> and nothing happened. The error log showed a preposterous, deeply hard to
> believe error message.
> 
> The wise person's first step in debugging strange failures on computers is 
> to
> type the error message into a search engine (I use Bing) to see if other 
> people
> had asked about it. To my great astonishment, no one had. This never 
> happens.
> Somebody else *always* has the same problem and has asked about it.
> 
> I then started reading the source code of Mailman, trying to see what
> circumstances would cause it to generate that message.  Mailman is written 
> in a
> language called Python. When you are having trouble like this, a good step 
> is to
> explore "version skew". Mailman Version XXX works only with Python Version 
> YYY.
> The versions of Python that are extant just now are 2.5, 2.6, 2.7, 3.2, 
> 3.3, and
> 3.4.  This is an abnormally large spread of "current" versions, which 
> usually
> means that the language developers have made incompatible changes and have 
> to
> keep old versions around for apps that have come to depend on them.
> 
> I tried all 6 of those Python versions. I got the same odd error in the 2.*
> versions, and absolute chaos in the 3.* versions. Since the version of 
> Mailman
> that I wanted to use (2.1.18) failed the same way with all of the 2.* 
> Python
> versions, I wiped the slate clean one last time and installed Python 2.7.
> 
> Gonna have to find this problem the old-fashioned way.
> 
> Many days pass as I read documentation, run tests, explore the software, 
> use
> debuggers, create and read log files, all to no avail.
> 
> Then I decided to instrument and log what was happening when Mailman/Python
> started up. Figuring out how much information to put in a log file is a 
> black
> art. If you log too much, you will never find what you are looking for in 
> the
> swamp of details. If you log too little, you probably won't log what you're
> looking for.
> 
> After far too much time staring at the logs, I saw that Python was 
> initializing
> from a library that was not listed in the Mailman docdumentation.
> 
> An aside: language systems like Python tend to be aggressive in how they 
> find
> libraries. They look around and if they find something that looks like a
> library, they use it. I'm sure the Python designers (none of whom is named
> Monty) thought they were doing the world a favor by making it go out and 
> find
> its own libraries. "Autoconfiguration" run amok. Bad idea.
> 
> This library was obsolete. In the 9 years of not upgrading, the Mailman 
> software
> had changed the place where it kept certain library functions, and both of 
> them
> were present in the version I was trying to run. The "wipe clean and 
> reinstall"
> function only wiped the directories that it knew about, and this obsolete
> directory was not on its list -- it had been retired years ago -- so it 
> didn't
> get removed by the "wipe clean" function.
> 
> If I had run all 12 of the upgrades between Mailman 2.1.6 and 2.1.18, one 
> of
> them would surely have deleted that newly-obsolete directory. But I 
> didn't, so
> it was still there.
> 
> When a complex computer system is using two different versions of the same
> library, with creation dates 7 years apart, it doesn't stand a chance of 
> working.
> 
> I typed the Unix command "rm -rf /local/mailman/Mailman/pythonlib/email"
> which got rid of the ancient and incompatible library
> and everything started working. Perfectly.
> 
> There were hundreds of loose ends, and I spent the next week hunting them 
> down,
> but it wasn't taking 18 hours a day and LUG mail was flowing while I did 
> it.
> 
> Thanks for listening.
> Brian Reid
> LUG Saloonkeeper and server wrangler
> 
> 
> 
> 
> 
> _______________________________________________
> Leica Users Group.
> See http://leica-users.org/mailman/listinfo/lug for more information
> .
> 

-- 

===========================================================
Dr Peter Dzwig                          



In reply to: Message from reid at mejac.palo-alto.ca.us (Brian Reid) ([Leica] Narrative about the extended LUG outage)