Parallelism on Its Own Feet: 90 Days as an NVidia Intern

Introduction

During the summer of 2019, I took an internship at NVidia’s Austin campus working with one of their CUDA compiler teams. Two teams of engineers there work on compiler related projects: the compiler verification team and the compiler development team. The compiler verification team ensures the correctness and consistent behavior of the CUDA compiler toolchain for a large number of applications on all platforms. The development team is in charge of the in-driver CUDA assembler, publicly known as ptxas. I worked under the supervision of Brian Deitrich, and mentored by Rishkul Kulkarni.

The compilation of CUDA code (and shader code) involves a rather complicated pipeline. Take the CUDA code as an example. The CUDA source code is first compiled and optimized using an LLVM based front-end compiler, namely nvcc. The NVCC compiler treats CUDA code as if they were just regular sequential code and produces an intermediate representation as an output. The intermediate representation is referred to as ptx code in publicly available documentations. The code for graphics shaders are compiled to ptx in a similar fashion. ptx code is directly embedded as textual strings into binaries, forming what is publicly known as fatbins. When the program launches, ptx is Just-In-Time (JIT) compiled to machine code in the driver through an optimizing code generator (OCG). OCG is both architectural aware and CUDA semantics aware: it’s where synchronization and most of the CUDA optimization happens. OCG is mainly written in traditional C++. The compiler development team, also known as the OCG team, is responsible for developing and maintaining the OCG compiler.

During my Internship at NVidia, I formulated and implemented a way to evaluate the performance of CUDA programs almost statically. This work opens up several possibilities for the team. It helps engineers to debug performance issues by presenting a number of “suspect code segments” to focus on. It potentially allows the compiler to optimize programs and make trade-offs in a case-by-case manner. Moreover, it helps the graphics team to identify the most time-consuming shaders from a large number of candidates.

The OCG Compiler

The OCG team has a unique and crucial technical standpoint in the company: every piece of code that runs on the GPU inevitably needs to go through OCG. This has many implications. First of all, the performance of the assembly emitted impacts the performance of all applications, and very often the OCG team has to make various trade-offs very carefully. Secondly, the OCG team works with the architectural design team very closely. NVidia has a long history of hardware-software co-design: instead of trying to embed more and more complicated logic into the hardware, like what Intel has been doing for decades, NVidia architectural teams work with OCG team to allocate various responsibilities to places that make the most sense. As an example, CUDA is a parallel programming language in that it describes the behavior of a large number of CUDA threads, which involves how these threads are synchronized. The responsibility of ensuring correct synchronization of CUDA threads are separated in to two places: the GPU provides hardware synchronization primitives, and the OCG team figures out places to insert the synchronization primitives based on its knowledge of the entire code structure (Although technically this design is not documented anywhere on publicly available design white-papers, this information is widely available due to the consistent effort of reverse engineering OCG compiler made by developer community.). This design allows the compiler to avoid performing synchronization at all for a small chunk of code, and to replace explicit synchronization with predication if the overhead of synchronization overweighs the actual computation. This architectural-compiler co-design in some sense the secret sauce of NVidia’s success: it not only free the architectural team from the burden of maintaining backward compatibility, but also allow existing ptx code to take advantage of new features of more recent hardware automatically when it’s JIT-compiled using \ptxas in a new driver.

On the other hand, this software-hardware co-design has an Achilles heel: the responsibility of ensuring good performance is greatly shifted to the compiler. Therefore, it’s hard to maintain a uniformly good performance across the board exactly because program performance depends on how effective optimization phases are. Moreover, the compiler needs to set a large number of performance-related parameters (termed knobs internally (More like semi-internally. A simple strings ptxas exposes almost all knobs.)) based on its knowledge of the targeting hardware along with its understanding of the code in question. Unfortunately, the same set of knobs very seldom works for all programs. As a solid example, one of such knob controls register usage. Spilling a register increases latency by hundreds of cycles. However, since hardware registers are shared among threads, spilling potentially allows more threads to run in parallel. Increased parallelism could hide the extra latency and improve performance. As you would imagine, whether we should spill the register depends on the program in question.

Instead of using the same set of parameters for all programs, it would make sense to allow the compiler to pick a set of knobs specifically for every program (if not code section) it compiles. This turns out to be extremely hard: since CUDA programs are parallel programs running on a throughput oriented architecture, merely predicting program performs is hard (Believe me, the OCG team has spent a tremendous amount of effort on trying to predict shader performance. I can’t disclose the details but it’s fair to say most, if not all, of the efforts, failed quite miserably.). To make things worse, performance CUDA programs, which are parallel programs, is largely determined by how well parallelism is exploited, which very often depends on how the input looks like. In conclusion, the real challenge lies in the fact good parameters have to be chosen by understanding the programs’ run-time behavior statically.

Reasoning Performance (Almost) Statically

As stated earlier, it became my challenge to find a way to extract performance-related information statically. The key thing to notice here is the notion of “performance-related information”. It’s not about cycle count. Trying to predict cycle count (or absolute run-time) without input data is a quest doomed to fail for obvious reasons: actual cycle count depends strongly on the shape and size of the input. For the same reason, it’s uninteresting for compiler engineers. What will be interesting for compiler designers, is more about “constituents” of overall cycle count. Answers to questions like “Is this loop latency bound or throughput bound?” and “Does this conditional branch contribute to overall latency bound” can significantly impact the register allocation strategy among some other knobs. Performance-related information involves more than cycle counts. CUDA programs run on GPU stream processors in a SIMD style. Program performance depends greatly on the utilization rate of each SIMD instruction executed. Predicting utilization involves predicting divergence and synchronization. Thanks to Compiler-Architectural co-design, the compiler determines (and is fully aware of) the synchronization structure of the program. This made my life much easier.

Without going into too much detail, what I proposed is to find a way to represent the control flow graph (CFG) of a program in an algebraic form, taking into account synchronization and divergence within a single stream processor. There’s generally no communication whatsoever between stream processors. Basic blocks are atomic “numerals” in this algebraic representation, and algebraic operators capture the control and synchronization structure of the CFG. Then I derived several “performance invariant” transformations to allow simplifications on the algebraic form. By annotating performance-related parameters on the operators and keep tracking these annotations during simplification, we may compute performance metrics for any given region of the program. The work contains “almost” in its title because the initial performance annotations need to be provided by the programmer (e.g., how likely is the branch taken). Those annotations do not need to be accurate for the algorithm to be effective. I implemented the work in C++ as an extra optimization pass of the OCG code-base.

Many CMU courses helped me to come up with the idea. 15-618 Parallel Architecture and Programming provided me with essential knowledge of GPU architecture. The CUDA programming assignment taught me well about what is it that I’m up against. 15-745 Optimizing Compiler for Modern Architectures discussed instruction scheduling and data flow analysis in detail. It turns out CUDA architecture, as an in-order architecture, relies on instruction scheduling to maintain even reasonable performance. 15-652 Foundations of Programming Languages proved to be surprisingly helpful. Fundamentally, the algebraic representation treats each of the basic block as a continuation. The algebraic representation and transformations are just manipulation of a program written in continuation-passing style. The idea of keeping track of performance annotations is derived from cost models used in the analysis of parallel programs. In my opinion, this serves as proof of how well-formulated theory could provide insight into the practice.

Internship Experience

The style of work I did during the summer is quite different than the daily work of other engineers on the team. It became quite apparent after once or twice Monday all-hands meetings. My work is more of a research project. It was both my and Brian’s intention to pick a rather self-contained project as my intern project. Daily work in the team usually involves bug fixes and various feature requests. Most of the feature requests are purely “functional” in the sense that a correct and reasonably performant straight forward implementation is sufficient. Once a month, an engineer may be put in charge of a project that requires a significant amount of designing and coordination with other teams. Needless to say, this type of change has a huge risk since it may impact the performance of the generated code. Any performance regression (especially on commercial benchmarking test suites such as 3D-Mark) may be picked by media as a design failure of the graphics hardware, unfortunately. This makes iterating the OCG compiler especially hard. Yet let’s just say there were still several pretty ambitious projects going on when I left the team ( Unfortunately, I cannot disclose what they are.).

The OCG code-base and workflow is much more “traditional” compared with any other Tech company. For example, the entire code base is managed by Perforce instead of git; OCG code is written in pre-2011 C++. It forbids the usage of STL containers for compatibility with some almost obsolete platforms. The coding guideline explicitly banned the usage of destructors, which makes modern style C++ resource management virtually impossible (The Resource Allocation Is Instantiation idiom). The code-base is currently in the process of upgrading to C++ 11, many restrictions are gradually being dropped.

The OCG code base is an epitome of “path dependence”[1] in software engineering. OCG was once part of a kernel space driver. Dynamic memory management is especially tricky in kernel space. All class constructors and functions using dynamic allocation have to take an extra argument specifying which memory pool it is allocating from. Later on, when the compiler was factored out to user space, the memory allocation protocol remains part of the code, forcing future code to operate in a similar fashion. This design is prone to all sorts of memory management issues. Later on, a solution comes along. First, set up a local memory pool for each pass, then in each pass, the programmer just keeps allocating memory from the local pool without worrying about freeing anything. At the end of each pass, free everything in the pool. This strategy works just fine for most of the passes (Most analysis and transformations consume space proportional to code size.). Unfortunately, it doesn’t work for my purpose. I end up implementing my own reference counting pointers to free myself from tedious memory management. The team is quite open to these new techniques. It feels nice to be able to bring fresh ideas to a decade-old code-base.

The OCG team is quite different in terms of its personnel compared with other teams in the tech industry. Facebook has an average career turnover of 2.6 yrs. An average OCG team employee worked at least 5 yrs on the project. Half of the engineers on the team hold a Ph.D. degree in fields related to compiling technologies. Honestly, I have mixed feelings for this. On the upside, the team is quite effective in that every member understands the implications of various design choices. Team discussions are often direct, to the point, and in no lack of creativity. On the other hand, this suggests the industry demand for highly skilled PL specialists is quite low, which made me question my career choice as a PL specialist.

I had the opportunity to discuss with Brian whether a Ph.D. degree is worthwhile. Brian holds a Ph.D. degree from UIUC. He worked with the IMPACT group. For Brian the Ph.D. program was a game-changer: it allowed him to switch track from an EE engineer to a software engineer. He believes the benefit of a doctoral degree for a professional career is mainly about opening up new possibilities. A doctoral degree qualifies the holder for almost any technical work in his field of interest, allowing the holder to choose a work that feels worthwhile for his effort. I find this point of view honest and practical.

Parallelism on Its Own Feet

There is one thing that I felt deeply unsatisfied during my summer at NVidia. It’s not about the company or the job. It’s when I realized how insufficient our toolbox felt when we try to reason about parallel programs. Nowadays multi-core parallelism is built into almost every computing device, yet we still treat parallel programs as if they were simply sequential programs extended with a few library calls. I believe it’s time for parallel programs to stand on their own feet.

Take the idea of CFG as an example. In OCG, we represent large scale parallel programs using CFGs, and struggle to prove properties of parallel behavior with it (e.g., synchronization correctness). The resulting algorithm is far from optimal and cryptic to understand. It’s very easy to understand why: properties of a sequential code can be easily tied to the value of the program counter, so to speak. However, there is no way to describe the state of a parallel program by just looking at the program counters of different threads: how instructions are interleaved across processors also affects the resulting program state. I would argue CFGs as an abstraction for parallel programs, is fundamentally lacking.

Our toolchain for parallel programs is also lacking. Debuggers in a parallel setup are confusing: what would single step even mean in a parallel setup? In what sense can you single-step a parallel program? To make things worse, debuggers are often useless in a parallel environment: simply attaching the debugger breaks the timing of events leading to the crash. Both cases suggest parallel programs are not a simple extension to sequential programs and trying to think about them sequentially simply won’t work.

It’s time for parallel programs to move away from their sequential ancestors and stand on their own feet.

[1] Path dependence explains how the set of decisions one faces for any given circumstance is limited by the decisions one has made in the past or by the events that one has experienced, even though past circumstances may no longer be relevant. Our Love Of Sewers: A Lesson in Path Dependence, Dave Praeger, 15 June 2008