Tuesday, March 15, 2005

Key open-source programming tool due for overhaul

The entire realm of open-source software could get a performance boost if all goes well with a plan to overhaul a crucial programming tool called GCC.

Almost all open-source software is built with GCC, a compiler that converts a program's source code--the commands written by humans in high-level languages such as C--into the binary instructions a computer understands. The forthcoming GCC 4.0 includes a new foundation that will allow that translation to become more sophisticated, said Mark Mitchell, the GCC 4 release manager and "chief sourcerer" of a small company called CodeSourcery.

"The primary purpose of 4.0 was to build an optimization infrastructure that would allow the compiler to generate much better code," Mitchell said.

Compilers are rarely noticed outside the software development community, but GCC carries broad significance. For one thing, an improved GCC could boost performance for the open-source software realm--everything from Linux and Firefox to OpenOffice.org and Apache that collectively compete with proprietary competitors from Microsoft, IBM and others.

For another, GCC is a foundation for an entire philosophy of cooperative software development. It's not too much of a stretch to say GCC is as central an enabler to the free and open-source programming movements as a free press is to democracy.

GCC, which stands for GNU Compiler Collection, was one of the original projects in the Gnu's Not Unix effort. Richard Stallman launched GNU and the accompanying Free Software Foundation in the 1980s to create a clone of Unix that's free from proprietary licensing constraints.

The first GCC version was released in 1987, and GCC 3.0 was released in 2001. A company called Cygnus Solutions, an open-source business pioneer acquired in 1999 by Linux seller Red Hat, funded much of the compiler's development.

But improving GCC isn't a simple matter, said Evans Data analyst Nicholas Petreley. There have been performance improvements that came from moving from GCC 3.3 to 3.4, but at the expense of backwards-compatibility: Some software that compiled fine with 3.3 broke with 3.4, Petreley said.

RedMonk analyst Stephen O'Grady added that updating GCC shouldn't compromise its ability to produce software that works on numerous processor types.

"If they can achieve the very difficult goal of not damaging that cross-platform compatibility and backwards-compatibility, and they can bake in some optimizations that really do speed up performance, the implications will be profound," O'Grady said.

What's coming in 4.0
GCC 4.0 will bring a foundation to which optimizations can be added. Those optimizations can take several forms, but in general, they'll provide ways that the compiler can look at an entire program.

For example, the current version of GCC can optimize small, local parts of a program. But one new optimization, called scalar replacement and aggregates, lets GCC find data structures that span a larger amount of source code. GCC then can break those objects apart so that object components can be stored directly in fast on-chip memory rather than in sluggish main memory.

"Optimization infrastructure is being built to give the compiler the ability to see the big picture," Mitchell said. The framework is called Tree SSA (static single assignment).

However, Mitchell said the optimization framework is only the first step. Next will come writing optimizations that plug into it. "There is not as much use of that infrastructure as there will be over time," Mitchell said.

One optimization that likely will be introduced in GCC 4.1 is called autovectorization, said Richard Henderson, a Red Hat employee and GCC core programmer. That feature economizes processor operations by finding areas in software in which a single instruction can be applied to multiple data elements--something handy for everything from video games to supercomputing.

GCC 4.0 also introduces a security feature called Mudflap, which adds extra features to the compiled program that check for a class of vulnerabilities called buffer overruns, Mitchell said. Mudflap slows a program's performance, so it's expected to be used chiefly in test versions, then switched off for finished products.

Also coming will be a preview of technology to compile programs written in Fortran 95, an updated version of a decades-old programming language still popular for scientific and technical tasks, Henderson said. And software written in the C++ programming language should run faster--"shockingly better" in a few cases, Henderson added.

GCC is a very general-purpose compiler. It can handle programs written in languages including C, C++, Java, Fortran, Pascal, Objective-C and Ada. It can generate software for processors including x86 models such as Pentium and Opteron, Sun's Sparc, Hewlett-Packard's PA-RISC, IBM's Power and mainframe processors, Intel's Itanium, MIPS, ARM, Hitachi's SuperH and Motorola's 68000 series.

"The promise of GCC has been portability and cross-platform support over speed," O'Grady said.

GCC has about 10 core programmers, Mitchell said. The commercialization and professionalization wave that arrived with Linux and other high-profile open-source projects has affected GCC.

"In terms of people writing the lion's share of code, most are doing it for a living at this point, in contrast to 10 years ago," Mitchell said. "A lot of the development work is very time-consuming and needs to have a long-term commitment. It's hard to do it during a two-week break during semesters."

CodeSourcery, with about a dozen employees, makes money by selling services around GCC and related low-level programming components such as the GNU C Library (glibc) of pre-written software components. For example, other companies pay CodeSourcery to support new operating systems or processors.

Other options
GCC isn't the only option available to programmers, of course. It's not even the only open-source compiler.

A start-up called PathScale offers an open-source compiler that's compatible with GCC 3.3. "Our company is trying to be the GCC alternative for people who care about high performance," said Len Rosenthal, vice president of marketing for PathScale.

PathScale's compiler is a version of the Open64 compiler released by Silicon Graphics as open-source software. It's in use at several national laboratories for supercomputing tasks, but Rosenthal said the compiler produces faster software even with general-purpose programs.

Rosenthal understands what PathScale is up against with GCC. "It's everywhere," he said. But PathScale still has a strong ambition: "Our goal is to be the default compiler on x86," the chip family that includes Intel's Pentium, he said.

A better-established GCC competitor is Intel, whose compilers are recognized to be the gold standard for software running on x86 chips. James Reinders, director of marketing and business software and the products division, proudly points out that the widely used MySQL open-source database uses Intel's compiler.

But in a curious twist, the very same compiler engineers at Intel also help with GCC. That's because GCC is a crucial tool to bring software to Intel's processors. For example, Intel helped adapt GCC so it could produce software for its Itanium processor, Reinders said.

"Obviously it's well-adopted," Reinders said. "GCC has a role in the community that it would be foolish to think it's not important."

No comments: