Archived posting to the Leica Users Group, 2014/07/01
[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]In case you care. Server computers that are engineered for reliability have two power supplies and two power cords. Power supplies are the most frequent component to fail in server computers, so having two of them makes it survive the outage of one. The server computer that had supported the LUG had two power supplies. They were stacked vertically, one on top of the other. Both power supplies had been running 24x7 for about 9 years, and their fans had sucked in a certain amount of lint. Lint is flammable. The bottom power supply failed, and the lint caught fire. The flame rose to the upper power supply and ignited its stored-up lint also. Like firestarters in a Franklin stove, the 20-second burst of flame was enough to ignite the various flammable items (including lint) in the main enclosure. The flash fire probably only lasted 40 or 50 seconds, but it was hot enough to destroy most of the solder traces that were near the power supplies on the circuit boards. There were various plastic tags on some of the cables, which added flammable material. You can go to the store and buy a laptop or a desktop computer, but you really can't go buy a server computer. Yes, this being silicon valley, there are stores around that sell server computers (Central Computer is the best of the lot) but buying a server computer at a retail store is like buying a bicycle at a department store. It's just not the same thing. Server computers are special-order, because there are so many variations on how they are built that no one can afford to keep good ones in inventory. The fire was on a Saturday morning, and I knew that the soonest I could even place an order for a replacement server was Monday, and even at rush-rush prices I wouldn't get it until Thursday. At the time a Saturday-to-Thursday outage seemed unconscionable. So I decided to move the LUG and its supporting software to the newest and emptiest of my half-dozen servers. It wasn't exactly a spare--it was running a few little things--but mostly it was idle. The LUG server had been running software from the era of its installation, about 2005. The new server was built with chips and components that the old software didn't understand, so I couldn't just restore the LUG server backups onto the new server. They wouldn't run. I had to get the new software working on the replacement server and then manuall move over each piece. I made the mistake of believing the operating system documentation, which detailed a function called "system upgrade". It was supposed to work they way Mac or Windows updates work--you let it do its thing for a while, and then you reboot and all is well. After running the system upgrade, nothing worked any more, including the few services that had been on that machine. After asking the experts, I realized that I was going to have to wipe the machine, do a clean install, get all of the necessary apps installed, and then restore both sets of backups (LUG server and previous contents of that server) to the clean system. So far this is not a crazy plan. I've done things like it many times before, though the 9-year software update gap made for a few challenges. Once I got all of the apps installed and the backups restored, I immediately typed the command to turn it all on /local/mailman/bin/mailmanctl start and nothing happened. The error log showed a preposterous, deeply hard to believe error message. The wise person's first step in debugging strange failures on computers is to type the error message into a search engine (I use Bing) to see if other people had asked about it. To my great astonishment, no one had. This never happens. Somebody else *always* has the same problem and has asked about it. I then started reading the source code of Mailman, trying to see what circumstances would cause it to generate that message. Mailman is written in a language called Python. When you are having trouble like this, a good step is to explore "version skew". Mailman Version XXX works only with Python Version YYY. The versions of Python that are extant just now are 2.5, 2.6, 2.7, 3.2, 3.3, and 3.4. This is an abnormally large spread of "current" versions, which usually means that the language developers have made incompatible changes and have to keep old versions around for apps that have come to depend on them. I tried all 6 of those Python versions. I got the same odd error in the 2.* versions, and absolute chaos in the 3.* versions. Since the version of Mailman that I wanted to use (2.1.18) failed the same way with all of the 2.* Python versions, I wiped the slate clean one last time and installed Python 2.7. Gonna have to find this problem the old-fashioned way. Many days pass as I read documentation, run tests, explore the software, use debuggers, create and read log files, all to no avail. Then I decided to instrument and log what was happening when Mailman/Python started up. Figuring out how much information to put in a log file is a black art. If you log too much, you will never find what you are looking for in the swamp of details. If you log too little, you probably won't log what you're looking for. After far too much time staring at the logs, I saw that Python was initializing from a library that was not listed in the Mailman docdumentation. An aside: language systems like Python tend to be aggressive in how they find libraries. They look around and if they find something that looks like a library, they use it. I'm sure the Python designers (none of whom is named Monty) thought they were doing the world a favor by making it go out and find its own libraries. "Autoconfiguration" run amok. Bad idea. This library was obsolete. In the 9 years of not upgrading, the Mailman software had changed the place where it kept certain library functions, and both of them were present in the version I was trying to run. The "wipe clean and reinstall" function only wiped the directories that it knew about, and this obsolete directory was not on its list -- it had been retired years ago -- so it didn't get removed by the "wipe clean" function. If I had run all 12 of the upgrades between Mailman 2.1.6 and 2.1.18, one of them would surely have deleted that newly-obsolete directory. But I didn't, so it was still there. When a complex computer system is using two different versions of the same library, with creation dates 7 years apart, it doesn't stand a chance of working. I typed the Unix command "rm -rf /local/mailman/Mailman/pythonlib/email" which got rid of the ancient and incompatible library and everything started working. Perfectly. There were hundreds of loose ends, and I spent the next week hunting them down, but it wasn't taking 18 hours a day and LUG mail was flowing while I did it. Thanks for listening. Brian Reid LUG Saloonkeeper and server wrangler