Archived posting to the Leica Users Group, 2009/07/31
[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]I get the sense that there are about half a dozen of us here on the LUG who are experienced IT professionals who have lived and breathed this disk/network/OS/reliability/make-it-work scenario for a long time. I wonder if we could work collectively to come up with a few specific recommendations. I have spent about half of my professional career on the design and test of reliable systems, but since I spent several of those years working with NASA, I know that I am in the minor leagues as these things go. (Anytime you think you know a lot about reliability engineering or testing, go spend some time at Johnson Space Center or Jet Propulsion Labs and you will change your mind.) Based entirely on my own experience, without consulting these equally talented folks, I have a few comments, in no particular order. * Reliability is a whole-system issue, not a component issue. You have to design under the assumption that components will fail, but it's always better to use components that fail less often. Remember the "for want of a nail the kingdom was lost" folk tale. A disk is only as reliable as the means that you have for getting data off of it. And unless you built it yourself, you are dependent on the disk system vendor to make the design decisions for the parts of it that are not actually disks. * Reliability is not cheap. You get what you pay for. I have worked with people who have spent their whole professional lives doing reliability engineering and reliability analysis. I spent 2 years of my life about 20 years ago redesigning the NASDAQ stock exchange network so that it would be more reliable, and I believe that engineers from Sun Microsystems did another redesign a dozen years later to get even more reliability. There were no "quick fixes". We had to re-think everything. * I believe that it is not possible to get full reliability on a Windows system no matter what you do. If you store no data on the Windows system, using it only as a client to access a reliable server, then you can just replace it when it breaks, but trying to build reliability directly into a Windows system is fundamentally a lost cause. * There are no magic brands at consumer price levels. In the consumer space, price is everything. If you want to buy Leica-grade components, you have to pay Leica-grade prices. If you think that some brand (e.g. Drobo) has been good for you, that just means that Drobo had at that time a very successful buyer and QA team. These things change, and in the consumer space there is no company that has rigid reviewed reliability standards for its products. You have to buy industrial-grade disks for that. * Because reliability is a whole-system issue, there are whole-system vendors that have better reliability records than others. This comes from having engineers who know how to design more-reliable systems, purchasing agents who know how to buy more-reliable components, QA departments that know how to test for reliability, manufacturing departments that know how both to manufacture for reliability and to incorporate what they learned from the QA department to make it better. * There are three fundamentally different philosophies of making things be reliable. I call them -- the Phone Company way -- the Internet way -- the NASA way I won't bore you by trying to define or describe them. But I use the Internet way in my personal life. I'd better shut up now. Brian Reid