On September 7, John Savage (with Andre' DeHon of UPenn) received a collaborative research grant from the National Science Foundation to study coded computation and storage at the nanoscale.
A key challenge before the semiconductor industry is coping with high error rates resulting from the decreasing size of chip features. Transient faults, along with permanent defects and stochastic assembly, make it difficult to implement traditional architectures. Research has been done on routing around defects and coping with large amounts of device variation. Little is known, however, about how to cope efficiently with high-rates of transient errors during computation. This research will take a new systematic approach to the tolerance of transient failures. The goal is to help the semiconductor industry to better understand the dimensions of the nanoscale reliability problem. This research has relevance to space-borne applications where error control can serve as an alternative to radiation hardening.
This research employs a sophisticated approach to fault-tolerant computation. First, it exploits differential reliability, that is, it examines the use of a small number of reliable elements to oversee a large number of unreliable elements. Second it draws on the success of coding theory to explore both special and general methods to encode inputs and outputs of a potentially faulty computation, paying particular attention to a seminal approach taken by Spielman in 1996. By encoding computations, faults at the encoded outputs can then be detected and corrected. Third, it examines the use of small check computations followed by possible rollback, where most of the checking is done using unreliable elements. Allowing a computation to be repeated in time, rather than space, reduces the overhead of fault free computations. The design work is expected to have immediate impact on practice whereas development of a general theory is expected to have a longer-term impact.