GPU Gems 3 is now available for free online!
The CD content, including demos and content, is available on the web and for download.
You can also subscribe to our Developer News Feed to get notifications of new material on the site.
Takeshi Yamanouchi
SEGA Corporation
In this chapter, we take up integer stream processing on the GPU, which has been at best a difficult task to do on the GPU up to now.
Traditionally the GPU has been used almost exclusively for floating-point operations, because integer operations could only be done using the mantissa of floats; thus, processes that required bitwise logical operations were impossible. Another issue was that we had to render the output of the GPU to textures or pixel buffers before we could get to our results. Therefore, to process streaming data, we needed to write tricky graphics code.
However, with the advent of the new GeForce 8 Series GPU, several new extensions and functions have been introduced to GPU programming. First, the new integerprocessing features include not only the arithmetic operations but also the bitwise logical operations (such as AND and OR) and the right/left shift operations. Second, array parameters and the new texture-buffer object provide a flexible way of referring to integer-indexed tables. And finally, with the new "transform feedback mode," it is now possible to store our results without the need to render to textures or pixel buffers.
In this chapter we present an application of these new features by implementing Advanced Encryption Standard (AES) (NIST 2001) encryption and decryption on the GPU. Unlike previous attempts (Cook et al. 2005) and precisely thanks to these new features, our implementation shows slight performance gains over CPU implementations.
In addition, within the AES system, we consider several block-cipher modes of operation (Dworkin 2001) that demonstrate the practical cases where the GPU can be used to optimize performance through parallel processing, and other cases where it cannot.
For our implementation of the AES encryption system, we use the new OpenGL extensions provided by NVIDIA in the GeForce 8 Series (NVIDIA 2006, OpenGL.org 2007). In this section we briefly introduce some of these extensions.
Transform feedback mode is a new OpenGL mode of operation. In this mode the GPU can send its output to a buffer before the rasterization stage. This way we can store the output of the shaders into buffer objects as vertex attributes. Figure 36-1 illustrates the process.
Figure 36-1 The Transform Feedback Mode Processing Pipeline
Rendering in transform feedback mode is like regular rendering from buffer objects except for the following differences:
We use two of the new features introduced in the assembly instruction set.
First, when declaring a register, we can either specify its type, such as FLOAT or INT, or just leave it typeless. We can also specify a type for each executed instruction that declares how the instruction should interpret its operands. This is done by using the type indicators following the opcodes. These type indicators are .F for float, .S for signed int, and .U for unsigned int:
TEMP r0, r1, r2; // typeless registers
MOV.U r0, 0x3f800000; // r0 = 1.0f
MAD.F r1, r0, 2, -1; // r1 = 1.0f
MAD.S r2, r0, 2, -1; // r2 = 0x7effffff
Second, we can refer to tables using an integer index, array parameters, or one of the newly introduced texture-buffer objects.
INT TEMP r0, r1;
MOV.S r0.x, 10;
MOV.S r0, 0; // integer index
MOV.S r1.x, table[r0.x].x; // array parameter
TXF.S r1.x, r0.x, texture[1], BUFFER; // texture-buffer object
On current GeForce 8 Series GPUs, the texture-buffer-object approach generally yields better performance than using array parameters.
The AES algorithm is currently the standard block-cipher algorithm that has replaced the Data Encryption Standard (DES). Back in 1997 the National Institute of Standards and Technology (NIST) made a public call for new cipher algorithms that could replace the DES. A rough summary of the requirements made by NIST for the new AES were the following:
Finally in October 2000, the Rijndael algorithm was chosen as the basis for the new standard encryption algorithm (Hironobu 2001). The original Rijndael algorithm also supported both fixed-size and variable-size bit cipher blocks. However, currently the Federal Information Processing Standards specification for the AES algorithm supports only the fixed-size, 128-bit blocks.
The operation of the AES algorithm is shown in Figure 36-2. The encryption step uses a key that converts the data into an unreadable ciphertext, and then the decryption step uses the same key to convert the ciphertext back into the original data. This type of key is a symmetric key; other algorithms require a different key for encryption and decryption.
Figure 36-2 AES Cipher Operation
The precise steps involved in the algorithm can be seen in Figure 36-3. The process is relatively simple, but some brief cryptographic explanations are necessary to understand what is going on. In cryptography, algorithms such as AES are called product ciphers. For this class of ciphers, encryption is done in rounds, where each round's processing is accomplished using the same logic. Moreover, many of these product ciphers, including AES, change the cipher key at each round. Each of these round keys is determined by a key schedule, which is generated from the cipher key given by the user.
Figure 36-3 An Illustration of the AES Algorithm
Generally speaking, the strength of an encryption by product ciphers can be heightened by increasing the number of rounds used to process the data. The AES standard specifies that the number of rounds is determined by the length of the cipher key, as shown in Figure 36-4.
Figure 36-4 Key Length and the Number of Rounds
Now that we know what the AES algorithm is supposed to do, let's see what its implementation looks like as a vertex program. The code given throughout this chapter uses C-style macros and comments to improve readability of the assembly language. These—like those in the ROT8 macro shown in Listing 36-1—get filtered out using the standard C preprocessor.
!!NVvp4.0
/*
* aes.vpt -- AES encryption and decryption
* Author Yamanouchi_Takeshi@sega.co.jp
*/
// input, output
ATTRIB state_in = vertex.attrib[0];
OUTPUT state_out = result.texcoord[0];
// macros
INT TEMP _tmp0, _tmp1; // macro work
#define ROT8(_arg) \
SHL.U _tmp0, _arg, 8; \
SHR.U _arg, _arg, 24; \
OR _arg, _arg, _tmp0
In this application we expand the cipher key using the CPU and store the key schedule in the GPU program-local parameters. Besides the key schedule, other arguments and tables can also be stored directly in advance to improve performance, as shown in Listing 36-2.
// parameters
// cipher mode, 0:encrypt, 1:decrypt
INT PARAM mode = program.local[MODE];
// number of rounds
INT PARAM num_round = program.local[NUM_ROUND];
// key schedule for encryption
INT PARAM enc_key[15] = { program.local[ENC_KEY_BEGIN..ENC_KEY_END] };
// key schedule for decryption
INT PARAM dec_key[15] = { program.local[DEC_KEY_BEGIN..DEC_KEY_END] };
AES encryption operates over a two-dimensional array of bytes, called the state. During the input step, we slice our data into sequential blocks of 16 bytes and unpack it into 4x4 arrays that we push onto the GPU's registers. Finally, during the output step, we pack these 4x4 arrays back into sequential blocks of 16 bytes and stream the results back to the transform feedback buffer, as shown in Figure 36-5.
Figure 36-5 Input, State, Output, and Register Assignment for the Streamed Data
The code fragments in Listing 36-3 go into the details for these steps. First we define our packing and unpacking methods as macros. Then we have an unpack_state_in: subroutine, which unpacks the input into the current state matrix, and a pack_state_out: subroutine, which packs the current state matrix back to the output stream.
The registers for state_in and state_out consist of four components of 32-bit integers each, giving us the standard AES block size of 128 bits for processing.
// registers
INT TEMP s0, s1, s2, s3; // state
#define UNPACK(_res, _arg) \
SHR _res.w, _arg, 24; \
SHR _res.z, _arg, 16; \
SHR _res.y, _arg, 8; \
MOV.U _res.x, _arg; \
AND _res, _res, 0xff
#define PACK(_res, _arg) \
SHL _tmp0.w, _arg.w, 24; \
SHL _tmp0.z, _arg.z, 16; \
SHL _tmp0.y, _arg.y, 8; \
OR _tmp0.w, _tmp0.w, _tmp0.z; \
OR _tmp0.w, _tmp0.w, _tmp0.y; \
OR _res, _tmp0.w, _arg.x
unpack_state_in:
// arg0: round key
XOR tmp, state_in, arg0;
UNPACK(s0, tmp.x);
UNPACK(s1, tmp.y);
UNPACK(s2, tmp.z);
UNPACK(s3, tmp.w);
RET;
pack_state_out:
// arg0: round key
PACK(tmp.x, s0);
PACK(tmp.y, s1);
PACK(tmp.z, s2);
PACK(tmp.w, s3);
XOR state_out, tmp, arg0;
RET;
During the initialization stage, we do an AddRoundKey operation, which is an XOR operation on the state by the round key, as determined by the key schedule. In our case, we perform unpacking and round key addition together in the unpack_state_in: subroutine.
A round for the AES algorithm consists of four operations: the SubBytes operation, the ShiftRows operation, the MixColumns operation, and the previously mentioned AddRoundKey operation.
The SubBytes operation substitutes bytes independently, in a black-box fashion, using a nonlinear substitution table called the S-box, as shown in Figure 36-6.
Figure 36-6 The SubBytes Operation
The S-box is a uniform table calculated in advance and stored into the texture-buffer objects as texture[] (see the code in Listing 36-2). In our implementation we store the encryption table in texture[1] and the transformation table used for decryption in texture[2], as you can see in Listing 36-4.
sub_bytes_shift_rows:
// first row
TXF.U s0.x, s0.x, texture[1], BUFFER; // texture[1]: s_box
TXF.U s1.x, s1.x, texture[1], BUFFER;
TXF.U s2.x, s2.x, texture[1], BUFFER;
TXF.U s3.x, s3.x, texture[1], BUFFER;
The ShiftRows operation shifts the last three rows of the state cyclically, effectively scrambling row data, as shown in Figure 36-7.
Figure 36-7 The ShiftRows Operation
One simple optimization we can do is to combine the SubBytes and ShiftRows operations into a single subroutine that we call sub_bytes_shift_rows:. The continuation of the code from the previous section that combines both operations as one is shown in Listing 36-5.
// second row
TXF.U tmp, s0.y, texture[1], BUFFER;
TXF.U s0.y, s1.y, texture[1], BUFFER;
TXF.U s1.y, s2.y, texture[1], BUFFER;
TXF.U s2.y, s3.y, texture[1], BUFFER;
MOV.U s3.y, tmp;
// third row
TXF.U tmp, s0.z, texture[1], BUFFER;
TXF.U s0.z, s2.z, texture[1], BUFFER;
MOV.U s2.z, tmp;
TXF.U tmp, s1.z, texture[1], BUFFER;
TXF.U s1.z, s3.z, texture[1], BUFFER;
MOV.U s3.z, tmp;
// fourth row
TXF.U tmp, s3.w, texture[1], BUFFER;
TXF.U s3.w, s2.w, texture[1], BUFFER;
TXF.U s2.w, s1.w, texture[1], BUFFER;
TXF.U s1.w, s0.w, texture[1], BUFFER;
MOV.U s0.w, tmp;
RET;
The next step is the MixColumns operation, which has the purpose of scrambling the data of each column. This operation is done by performing a matrix multiplication upon each column vector, as shown in Figure 36-8.
Figure 36-8 The MixColumns Operation
Because the multiplication is over a finite field of the AES, we cannot use the usual MUL operation. The closure of the finite field is within a byte range. So we store the results of the multiplication by (0x2, 0x1, 0x1, 0x3) in texture[3]. The resulting bytes are packed into a 32-bit integer in little-endian form, such as 0x03010102.
The resulting transformation matrix is shifted by columns; therefore, the results are rotated by the ROT8(). . ROT24() macros. Finally, we add them together by XOR'ing and unpacking into the state, as seen in Listing 36-6.
mix_columns_add_round_key:
// arg0: round key
TXF.U mix0.x, s0.x, texture[3], BUFFER; // texture[3]: mix_col
TXF.U mix0.y, s1.x, texture[3], BUFFER;
TXF.U mix0.z, s2.x, texture[3], BUFFER;
TXF.U mix0.w, s3.x, texture[3], BUFFER;
TXF.U mix1.x, s0.y, texture[3], BUFFER;
. . .
TXF.U mix3.w, s3.w, texture[3], BUFFER;
ROT8(mix1);
ROT16(mix2);
ROT24(mix3);
XOR tmp, arg0, mix0; // add round key
XOR tmp, tmp, mix1;
XOR tmp, tmp, mix2;
XOR tmp, tmp, mix3;
UNPACK(s0, tmp.x);
UNPACK(s1, tmp.y);
UNPACK(s2, tmp.z);
UNPACK(s3, tmp.w);
RET;
During decryption, the multiplication results are stored in texture[4]. The coefficients are (0xe, 0x9, 0xd, 0xb).
In brief, all components of texture[] at this point are as follows:
This operation determines the current round key from the key schedule, where the register arg0 serves as the argument. As an optimization we can also combine the MixColumns and AddRoundKey operations into a single subroutine named mix_columns_add_round_key:.
The code in Listing 36-7 is the encryption loop. Note that the cost of control flow operations has significantly decreased in the GeForce 8 Series: so much so that unrolling the loop—which would have further complicated the code—is unnecessary to obtain good performance.
encrypt:
// input & first round
MOV.U arg0, enc_key[0];
CAL unpack_state_in;
// loop round
SUB.S cnt.x, num_round.x, 1;
MOV.S cnt.y, 1;
REP.S cnt.x;
CAL sub_bytes_shift_rows;
MOV.U arg0, enc_key[cnt.y];
CAL mix_columns_add_round_key;
ADD.S cnt.y, cnt.y, 1;
ENDREP;
// final round & output
CAL sub_bytes_shift_rows;
MOV.U arg0, enc_key[cnt.y];
CAL pack_state_out;
RET;
The final round has no MixColumns operation. The AddRoundKey operation is combined with packing into a single subroutine called pack_state_out: in a similar fashion to what we did for initialization.
Now that we have a working AES implementation, let us measure the performance of GPU-based encryption. The decryption is omitted because it performs the same as the encryption in the AES algorithm. Our tests were performed on a test machine with the following specifications:
We compared the performance of the vertex program in the transform feedback mode pipeline with that of the fragment program in the traditional rendering pipeline. The fragment program is the same code as the vertex program except that input/output registers were redefined appropriately, exploiting the GPU's unified architecture.
Our results were obtained by processing a plaintext of 128 MB filled with random numbers and averaging measurements from ten runs. As illustrated in Figure 36-9, the throughput for the vertex program is 53 MB/sec, whereas for the fragment program, the throughput is 95 MB/sec with a batch size of 1 MB. Our implementation spends most of its processing time in referencing tables—in other words, fetching textures.
Figure 36-9 Vertex Program vs. Fragment Program
The openssl command has benchmarks for determining the speed of their cipher algorithms on the CPU. We measured the speed of these CPU-based encryptions on the same test machine, and we got results of about 55 MB/sec using the same AES algorithm. Compared to the speed of the CPU-based implementations, our GPU-based encryption is thus almost the same speed for the vertex program, and about 1.7 times faster for the fragment program. The transform feedback mode makes it easy to write code for data streaming, but it does not perform as well for now.
There are some considerations to be made based on the mode of operation for our block ciphers that affect suitability for parallelism. In this section, we look at a few different modes and how they may affect suitability for parallel processing.
The encrypting operation of each block is called the electronic code book (ECB) mode. In ECB encryption, each block is operated upon independently, as shown in Figure 36-10.
Figure 36-10 ECB Mode Encryption and Decryption
This mode of operation makes it possible to compute ciphertexts in parallel, and it was the mode of operation used for all the tests in Section 36.4.
In ECB mode, the same bit blocks will always generate the same ciphertext. But although the ciphertext is not directly readable, there is a possibility that the content can be guessed based on the resulting pattern. Figure 36-11 shows a 256x256 texture encrypted in ECB mode, and although it cannot be directly read, there is an evident pattern in the cipher texture.
Figure 36-11 Plain Texture and Cipher Texture in ECB Mode
The cipher-block chaining (CBC) mode overcomes the ECB mode's weakness by chaining each cipher into the next block. This is done by chaining (XOR'ing) our current plaintext block with the previously encrypted block. The first block has no previous block, so we use an initial vector (IV) instead that is passed during initialization along with the symmetric key for decryption. See Figure 36-12.
Figure 36-12 CBC Mode Encryption and Decryption
In CBC mode, because plaintexts are scrambled before encryption, the ciphertext pattern never appears in the final result. So we cannot see the "AES" letters in Figure 36-13 as we could in Figure 36-11.
Figure 36-13 The Cipher Texture in CBC Mode
However, because CBC mode needs the ciphertexts of each previous step to process the next step, it is not possible to begin the encryption of a block until its previous block has been encrypted. So we can't hope for parallel processing during the encryption stage of this mode.
As we just mentioned, in CBC mode, encryption cannot be processed in parallel because it requires the results of each previous step. However, fortunately we can decrypt our results using parallel processing. This is because after encryption, we already know the states of all the previous ciphertext blocks and the IV needed for decryption.
The gain to be obtained here is that we often need to do more decryption than encryption, and many times decryption is required to have higher throughput speeds, such as when preprocessing and storing cipher textures encrypted in CBC mode into distribution files; these files can then be decrypted in parallel during loading without any noticeable delays.
Nevertheless, there are cases when fast encryption is highly desirable, such as for secure communications, where the parallel processing of the encryption stage would greatly speed up the overall performance. In these cases, counter (CTR) mode can be used. The CTR algorithm's procedure is shown in Figure 36-14. First, we make the input block by indexing the current block, called a counter block. We then encrypt it using the symmetric key, and XOR'ing with the original plaintext block.
Figure 36-14 CTR Mode Encryption and Decryption
The decryption for CTR can be done following the same steps. Thus we can encrypt and decrypt each cipher block independently, giving us the benefit of true parallelization.
Thanks to new capabilities added to the GPU, we can now truly perform integer stream processing. As a showcase, we presented an implementation for an AES system that relies heavily on integer processing.
Our implementation is OpenGL-based, but another way to program the latest generation of NVIDIA GPUs is to use CUDA (NVIDIA 2007). As future work, we plan to implement the same algorithm using CUDA and compare performance against the OpenGL implementation.
Another direction of future work is to find other novel areas besides AES encryption that could benefit from GPU parallel processing power.
Cook, Debra L., John Ioannidis, Angelos D. Keromytis, and Jake Luck. 2005. "CryptoGraphics: Secret Key Cryptography Using Graphics Cards." In RSA Conference, Cryptographer's Track (CT-RSA), pp. 334–350.
Dworkin, M. 2001. "Recommendation for Block Cipher Modes of Operation: Methods and Techniques." NIST Special Publication 800-38A. Available online at http://csrc.nist.gov/publications/nistpubs/800-38a/sp800-38a.pdf.
Hironobu, Suzuki. 2001. "Memorandum of DES and AES." Available online (in Japanese only) at http://h2np.net/hironobu/docs/DES_and_AES.pdf.
NIST. 2001. "Advanced Encryption Standard (AES)." FIPS Publication 197. Available online at http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf.
NVIDIA Corporation. 2006. "NVIDIA OpenGL Extension Specifications for the GeForce 8 Series Architecture (G8x)." Available online at http://developer.download.nvidia.com/opengl/specs/g80specs.pdf.
NVIDIA Corporation. 2007. "NVIDIA Compute Unified Device Architecture." http://developer.nvidia.com/cuda/.
OpenGL.org. 2007. "OpenGL Extension Registry." http://www.opengl.org/registry/.