The IBM Blue Gene/Q-based Sequoia supercomputer at the Lawrence Livermore National Laboratory (LLNL) is the second most powerful in the world, capable of delivering 20 petaflops of peak power. Given the number of processes it can run, debugging a system such as Sequoia presents a major logistical challenge. However, using a piece of analysis software, researchers reported they were able to debug a program running more than one million MPI processes.
Developers attempting to scale up the Sequoia supercomputer repeatedly experienced hardware and software defects that led to application failures, particularly when scaled to 1,179,648 compute cores. Even a basic test program was difficult to debug when running on such a large scale, according to Livermore Computing scientist Dong Ahn.
“Finding and dealing with a ‘bug’ in one of these systems makes the same process on a PC look like child’s play; like looking for the red circle amongst a pile of blue squares,” wrote redOrbit.com’s Michael Harper.
However, researchers developed a lightweight program called the Stack Trace Analysis Tool (STAT) that could quickly sift through large volumes of code to spot bugs, and, using STAT, Ahn was able to pinpoint a specific rank process that was stuck in a system call. An engineer was then able to identify and correct a hardware defect.
“Replacing the component suddenly got the entire Sequoia system back to life,” Ahn said. “Putting this exercise into perspective, this error was due to a defect in a tiny hardware unit, the decrementor, of a single hardware thread out of a total of 4.7 million hardware threads. I felt it was like finding a needle in a haystack over a coffee break.”
Lessons from STAT
Sequoia is designed to handle physics and engineering questions for the National Nuclear Security Administration (NNSA), as well as support programs designed for nonproliferation, counterterrorism, climate change, energy, health and security. As a result, ensuring that it runs successfully is a top security concern.
“Having a highly effective debugging tool that scales to the full system is vital to the installation and acceptance process for Sequoia,” said Kim Cupps, leader of the Livermore Computing Division at LLNL.
However, bug-free performance is often just as critical in other applications, and STAT also provides an example of the value of source code analysis in creating a secure, stable piece of software for any context. By using static analysis tools, developers can identify bugs in large volumes of code much more quickly than through conventional testing.
Software news brought to you by Klocwork Inc., dedicated to helping software developers create better code with every keystroke.