**GPU Gems 3** is now available for free online!

The CD content, including demos and content, is available on the web and for download.

You can also subscribe to our Developer News Feed to get notifications of new material on the site.

*Fan Zhang
The Chinese University of Hong Kong*

*Hanqiu Sun
The Chinese University of Hong Kong*

*Oskari Nyman
Helsinki University of Technology*

Shadow mapping (Williams 1978) has been used extensively in 3D games and applications for producing shadowing effects. However, shadow mapping has inherent aliasing problems, so using standard shadow mapping is not enough to produce high-quality shadow rendering in complex and large-scale scenes. In this chapter we present an advanced shadow-mapping technique that produces antialiased and real-time shadows for large-scale environments. We also show the implementation details of our technique on modern programmable GPUs.

The technique we present is called *parallel-split shadow maps* (PSSMs) (Zhang et al. 2007 and Zhang et
al. 2006). In this technique the view frustum is split into multiple depth layers using clip planes parallel to
the view plane, and an independent shadow map is rendered for each layer, as shown in
Figure 10-1.

Figure 10-1 Parallel-Split Shadow Maps in

The split scheme is motivated by the observation that points at different distances from the viewer need different shadow-map sampling densities. By splitting the view frustum into parts, we allow each shadow map to sample a smaller area, so that the sampling frequency in texture space is increased. With a better matching of sampling frequencies in view space and texture space, shadow aliasing errors are significantly reduced.

In comparison with other popular shadow-mapping techniques, such as *perspective shadow maps* (PSMs)
(Stamminger and Drettakis 2002), *light-space perspective shadow maps* (LiSPSMs) (Wimmer et al. 2004) and
*trapezoidal shadow maps* (TSMs) (Martin and Tan 2004), PSSMs provide an intuitive way to discretely
approximate the warped distribution of the shadow-map texels, but without mapping singularities and special
treatments.

The idea of using multiple shadow maps was introduced in Tadamura et al. 2001. It was further studied in Lloyd et al. 2006 and implemented as cascaded shadow mapping in Futuremark's benchmark application 3DMark 2006.

The two major problems with all these algorithms, including our PSSMs, are the following:

- How do we determine the split positions?
- How do we alleviate the performance drop caused by multiple rendering passes when generating shadow maps and rendering the scene shadows?

These two problems are handled well in PSSMs. We use a fast and robust scene-independent split strategy to adapt the view-driven resolution requirement, and we show how to take advantage of DirectX 9-level and DirectX 10-level hardware to improve performance.

In this chapter we use a directional light to illustrate the PSSMs technique for clarity, but the
implementation details are valid for point lights as well. The notation PSSM(*m*, *res*) denotes
the split scheme that splits the view frustum into *m* parts; *res* is the resolution for each
shadow map. Figure 10-2 shows the general
configuration for the split scheme with an overhead light, in which the view frustum *V* is split into
{*V _{i}* | 0

Figure 10-2 The Configuration for the Split Scheme with an Overhead Light

We apply the split-scheme PSSM(*m*, *res*) in the following steps:

- Split the view frustum
*V*into*m*parts {*V*} using split planes at {_{i}*C*}._{i} - Calculate the light's view-projection transformation matrix for each split part
*V*._{i} - Generate PSSMs {
*T*} with the resolution_{i}*res*for all split parts {*V*}._{i} - Synthesize shadows for the scene.

The main problem in step 1 is determining where the split planes are placed. Although users can adjust the
split positions according to an application-specific requirement, our implementations use the
*practical split scheme* proposed in Zhang et al. 2006.

Before we introduce this split scheme, we briefly review the aliasing problem in shadow mapping, which has been extensively discussed in Stamminger and Drettakis 2002, Wimmer et al. 2004, and Lloyd et al. 2006.

In Figure 10-3, the light beams passing through a
texel (with the size *ds*) fall on a surface of the object with the length *dz* in world space.
From the standard projection matrix in Direct3D, we know that the size of view beams *dp* (on the
normalized screen) projected from the surface is *dy*(*z* tan )^{-1}, where
2 is the field-of-view of the view frustum. From the local view of the surface in
Figure 10-3, we get the approximation *dy*
*dz* cos *f*/cos , where and stand for the angles between the surface
normal and vector to the screen, and the shadow-map plane, respectively.

Figure 10-3 Shadow Map Aliasing

The aliasing error for the small surface is then defined as

We usually decompose this aliasing representation *dp*/*ds* into two parts:
*perspective aliasing dz*/*zds* and *projection aliasing* cos /cos (note
that is a constant for a given view matrix). Shadow-map undersampling can occur when perspective
aliasing or projection aliasing becomes large.

Projection aliasing depends on local geometry details, so the reduction of this kind of aliasing requires an expensive scene analysis at each frame. On the other hand, perspective aliasing comes from the perspective foreshortening effect and can be reduced by warping the shadow plane, using a perspective projection transformation. More important, perspective aliasing is scene independent, so the reduction of perspective aliasing doesn't require complicated scene analysis.

The practical split scheme is based on the analysis of perspective aliasing.

In the practical split scheme, the split positions are determined by this equation:

where {
} and {
} are the split positions in the *logarithmic split scheme* and the *uniform split scheme*,
respectively. Theoretically, the logarithmic split scheme is designed to produce optimal distribution of
perspective aliasing over the whole depth range. On the other hand, the aliasing distribution in the uniform
split scheme is the same as in standard shadow maps.
Figure 10-4 visualizes the three split schemes in the
shadow-map space.

Figure 10-4 Visualization of the Practical Split Scheme in Shadow-Map Space

The theoretically optimal distribution of perspective aliasing makes *dz*/*zds* constant over the
entire depth range. As shown in Wimmer et al. 2004, the following simple derivation gives the optimal shadow-map
parameterization *s*(*z*):

where is a constant. From *s*(*f*) = 1, we have = ln(*f*/*n*).
The only nonlinear transform supported by current hardware is the perspective-projection transformation
(*s* = *A*/*z* + *B*), so in order to approximate the previous logarithmic mapping
from *z* to *s*, the logarithmic split scheme discretizes it at several depth layers z =
as in the following equation:

Because the logarithmic split scheme is designed to produce the theoretically even distribution of perspective
aliasing, the resolution allocated for each split should be 1/*m* of the overall texture resolution.
Substituting *s _{i}* =

The main drawback of the logarithmic split scheme is that, in practice, the lengths of split parts near the
viewer are too small, so few objects can be included in these split parts. Imagine the situation in the
split-scheme PSSM(3)—that is, the view frustum is split into three parts. With *f* = 1000 and *n*
= 1, the first split *V* _{0} and second split *V* _{1} occupy only 1 percent and
10 percent of the view distance. Oversampling usually occurs for parts near the viewer, and undersampling occurs
for parts farther from the viewer. As a result, in practice, the logarithmic split scheme is hard to use
directly.

The uniform split scheme simply places the split planes evenly along the view direction:

The aliasing distribution in the uniform split scheme is the same as that in standard shadow maps. Because
*s* = (*z*- *n*)/(*f* - *n*) in standard shadow maps, by ignoring the
projection aliasing, we have

This means the aliasing error in standard shadow maps increases hyperbolically as the object moves near the view plane. Like standard shadow mapping, the uniform split scheme results in undersampling for objects near the viewer and oversampling for distant objects.

Because the logarithmic and uniform split schemes can't produce appropriate sampling densities for the near and far depth layers simultaneously, the practical split scheme integrates logarithmic and uniform schemes to produce moderate sampling densities over the whole depth range. The weight adjusts the split positions according to practical requirements of the application. = 0.5 is the default setting in our current implementation. In Figure 10-4, the practical split scheme produces moderate sampling densities for the near and far split parts. An important note for Figure 10-4: the colorized sampling frequency in each split scheme is for schematic illustration only, not for precisely visualizing the aliasing distribution in theory. Readers interested in the (perspective) aliasing distributions in multiple depth layers can refer to Lloyd et al. 2006.

Before splitting the view frustum, we might adjust the camera's near and far planes so that the view frustum contains the visible objects as tightly as possible. This will minimize the amount of empty space in the view frustum, increasing the available shadow-map resolution.

As in standard shadow mapping, we need to know the light's view and projection matrices when generating the
shadow map. As we split the light's frustum *W* into multiple subfrusta {*W _{i}* }, we
will need to construct an independent projection matrix for each

We present two methods to calculate these matrices:

- The scene-independent method simply bounds the light's frustum
*W*to the split frustum_{i}*V*, as seen in Figure 10-5. Because information about the scene is not used, the usage of shadow-map resolution might not be optimal._{i}Figure 10-5 Calculation of the Light's Projection Matrix for the Split Part .

- The scene-dependent method increases the available texture resolution by focusing the light's frustum
*W*on the objects that potentially cast shadows to the split frustum_{i}*V*, as seen later in Figure 10-6._{i}Figure 10-6 Comparing Projection Methods

Before calculating the projection matrix for the subfrustum *W _{i}* , we must find the

To simplify the computations, we determine the axis-aligned bounding box of the split frustum in the light's
clip space, using the function `CreateAABB()`. Note that we set the bounding box's minimum z-value to 0
because we want the near plane position to remain unchanged. We want this position unchanged because we might
have shadow casters between the near plane of the split's bounding box and the light's near plane. This case is
illustrated by the "caster" object in Figure 10-5.

Afterward, we build a *crop* matrix to effectively "zoom in" the light's frustum *W _{i}*
to the bounding box computed by

```
// Build a matrix for cropping light's projection
// Given vectors are in light's clip space Matrix
Light::CalculateCropMatrix(Frustum splitFrustum)
{
Matrix lightViewProjMatrix = viewMatrix * projMatrix;
// Find boundaries in light's clip space
BoundingBox cropBB = CreateAABB(splitFrustum.AABB, lightViewProjMatrix);
// Use default near-plane value
cropBB.min.z = 0.0f;
// Create the crop matrix
float scaleX, scaleY, scaleZ;
float offsetX, offsetY, offsetZ;
scaleX = 2.0f / (cropBB.max.x - cropBB.min.x);
scaleY = 2.0f / (cropBB.max.y - cropBB.min.y);
offsetX = -0.5f * (cropBB.max.x + cropBB.min.x) * scaleX;
offsetY = -0.5f * (cropBB.max.y + cropBB.min.y) * scaleY;
scaleZ = 1.0f / (cropBB.max.z - cropBB.min.z);
offsetZ = -cropBB.min.z * scaleZ;
return Matrix(scaleX, 0.0f, 0.0f, 0.0f, 0.0f, scaleY, 0.0f, 0.0f, 0.0f, 0.0f,
scaleZ, 0.0f, offsetX, offsetY, offsetZ, 1.0f);
}
```

Finally, we use the light's view-projection transformation
`lightViewMatrix * lightProjMatrix * cropMatrix` when generating the shadow map texture
*T _{i}* .

Note that instead of computing a crop matrix in the light's clip space, we can calculate the split-specific projection matrix in other spaces, such as the light's view space. However, because both point and directional lights are converted into directional lights in the light's clip space, our method works the same way for directional and point lights. Furthermore, calculating the bounding box in the light's clip space is more intuitive.

Although using a scene-independent projection works well in most cases, we can further optimize the usage of the texture resolution by taking the scene geometries into account.

As we can see in Figure 10-6, a scene-independent
*W _{i}* may contain a large amount of empty space. Each

In the scene-dependent method, the near plane of *W _{i}* is moved to bound either

The final light's view-projection transformation matrix for the current split is still
`lightViewMatrix * lightProjMatrix * cropMatrix`.

```
Matrix Light::CalculateCropMatrix(ObjectList casters, ObjectList receivers,
Frustum frustum)
{
// Bounding boxes
BoundingBox receiverBB, casterBB, splitBB;
Matrix lightViewProjMatrix = viewMatrix * projMatrix;
// Merge all bounding boxes of casters into a bigger "casterBB".
for (int i = 0; i < casters.size(); i++)
{
BoundingBox bb = CreateAABB(casters[i]->AABB, lightViewProjMatrix);
casterBB.Union(bb);
}
// Merge all bounding boxes of receivers into a bigger "receiverBB".
for (int i = 0; i < receivers.size(); i++)
{
bb = CreateAABB(receivers[i]->AABB, lightViewProjMatrix);
receiverBB.Union(bb);
}
// Find the bounding box of the current split
// in the light's clip space.
splitBB = CreateAABB(splitFrustum.AABB, lightViewProjMatrix);
// Scene-dependent bounding volume
BoundingBox cropBB;
cropBB.min.x = Max(Max(casterBB.min.x, receiverBB.min.x), splitBB.min.x);
cropBB.max.x = Min(Min(casterBB.max.x, receiverBB.max.x), splitBB.max.x);
cropBB.min.y = Max(Max(casterBB.min.y, receiverBB.min.y), splitBB.min.y);
cropBB.max.y = Min(Min(casterBB.max.y, receiverBB.max.y), splitBB.max.y);
cropBB.min.z = Min(casterBB.min.z, splitBB.min.z);
cropBB.max.z = Min(receiverBB.max.z, splitBB.max.z);
// Create the crop matrix.
float scaleX, scaleY, scaleZ;
float offsetX, offsetY, offsetZ;
scaleX = 2.0f / (cropBB.max.x - cropBB.min.x);
scaleY = 2.0f / (cropBB.max.y - cropBB.min.y);
offsetX = -0.5f * (cropBB.max.x + cropBB.min.x) * scaleX;
offsetY = -0.5f * (cropBB.max.y + cropBB.min.y) * scaleY;
scaleZ = 1.0f / (cropBB.max.z – cropBB.min.z);
offsetZ = -cropBB.min.z * scaleZ;
return Matrix(scaleX, 0.0f, 0.0f, 0.0f, 0.0f, scaleY, 0.0f, 0.0f, 0.0f, 0.0f,
scaleZ, 0.0f, offsetX, offsetY, offsetZ, 1.0f);
}
```

Steps 3 and 4 can be implemented differently, depending on your system hardware. Because we use multiple shadow maps in PSSMs, extra rendering passes may be needed when we generate shadow maps and synthesize shadows.

To reduce this burden on rendering performance, we could use a hardware-specific approach. In the next section we present three hardware-specific methods for generating PSSMs and synthesizing shadows:

- The multipass method (without hardware acceleration): This method doesn't use hardware acceleration; multiple rendering passes are required for generating shadow maps (step 3) and for synthesizing shadows (step 4).
- Partially accelerated on DirectX 9-level hardware (for example, Windows XP and an NVIDIA GeForce 6800 GPU): In this method, we remove extra rendering passes for synthesizing scene shadows.
- Fully accelerated on DirectX 10-level hardware (for example, Windows Vista and a GeForce 8800 GPU): Here we alleviate the burden of extra rendering passes for generating shadow maps. To achieve this goal, we describe two different approaches: geometry shader cloning and instancing.

In all three methods, the implementations for splitting the view frustum and calculating the light's transformation matrices are the same as described earlier.

To make it easier to understand our three hardware-specific implementations, we visualize the rendering pipelines of different implementations, as in Figure 10-7.

Figure 10-7 Visualization of Rendering Pipelines

In the first implementation of PSSMs, we simply render each shadow map, with shadows in each split part, sequentially.

Once we have the light's view-projection transformation matrix
`lightViewMatrix * lightProjMatrix * cropMatrix` for the current split part, we can render the associated
shadow-map texture. The procedure is the same as for standard shadow mapping: we render all shadow casters to
the depth buffer. Our implementation uses a 32-bit floating-point (R32F) texture format.

In our multipass implementation, instead of storing PSSMs in several independent shadow-map textures, we reuse a single shadow-map texture in each rendering pass, as shown in Listing 10-3.

```
for (int i = 0; i < numSplits; i++)
{
// Compute frustum of current split part.
splitFrustum = camera->CalculateFrustum(splitPos[i], splitPos[i + 1]);
casters = light->FindCasters(splitFrustum);
// Compute light's transformation matrix for current split part.
cropMatrix = light->CalculateCropMatrix(receivers, casters, splitFrustum);
splitViewProjMatrix = light->viewMatrix * light->projMatrix * cropMatrix;
// Texture matrix for current split part
textureMatrix = splitViewProjMatrix * texScaleBiasMatrix;
// Render current shadow map.
ActivateShadowMap();
RenderObjects(casters, splitViewProjMatrix);
DeactivateShadowMap();
// Render shadows for current split part.
SetDepthRange(splitPos[i], splitPos[i + 1]);
SetShaderParam(textureMatrix);
RenderObjects(receivers, camera->viewMatrix * camera->projMatrix);
}
```

With PSSMs generated in the preceding step, shadows can now be synthesized into the scene. In the multipass method, we need to synthesize the shadows immediately after rendering the shadow map.

For optimal performance, we should render the splits from front to back. However, some special considerations are needed. We need to adjust the camera's near and far planes to the respective split positions to provide clipping, because the objects in one split may also overlap other splits. However, as we modify the near and far planes, we are also changing the range of values written to the depth buffer. This would make the depth tests function incorrectly, but we can avoid this problem by using a different depth range in the viewport. Simply convert the split-plane positions from view space to clip space and use them as the new depth minimum and maximum.

After this we can render the shadow receivers of the scene while sampling the shadow map in the standard manner. This means that implementing the multipass method doesn't require the use of shaders. Figure 10-8 compares SSM(2Kx2K) and multipass PSSM(3; 1Kx1K). However, as the method's name implies, multiple rendering passes will always be performed for both generating shadow maps and rendering scene shadows.

Figure 10-8 Comparison of SSM and Multipass PSSM

When implementing the simple multipass method as presented, we need to use multiple rendering passes to synthesize shadows in the scene. However, with programmable GPUs, we can perform this step in a single rendering pass.

We can create a pixel shader to find and sample the correct shadow map for each fragment. The operation for
determining which split a fragment is contained in is simple because the split planes are parallel to the view
plane. We can simply use the depth value—or, as in our implementation, use the view space *z* coordinate
of a fragment—to determine the correct shadow map.

Because we render the scene shadows in a single pass, we need to have access to all the shadow maps simultaneously. Unlike our previous implementation, this method requires that shadow maps be stored separately, so we create an independent texture for each shadow map.

Note that the number of texture samplers supported by the hardware will limit the number of shadow maps we can use. We could, alternatively, pack multiple shadow maps in a single texture; however, most DirectX 9-level hardware supports eight texture samplers, which should be more than enough.

Also note that we need to use the "border color" texture address mode, because sometimes shadow maps are
sampled outside the texture-coordinate range. This could occur when we use a scene-dependent projection, because
the light's subfrustum *W _{i}* does not necessarily cover all the receivers.

When we render the scene shadows, we use a custom vertex and pixel shader, as shown in Listing 10-4. In the vertex shader, we transform each vertex as usual. However, we also calculate (1) the view-space position of the vertex, because we will use it for determining the shadow maps associated with fragments; and (2) the position of the vertex in the texture space of each shadow map. We determine the position of the vertex by multiplying the vertex with a textureMatrix. It is the same approach we use for standard shadow mapping, only now we use it for all shadow maps. We output all the transformed coordinates into texture-coordinate registers for the pixel shader to use.

In the pixel shader, we loop through all splits and test whether the view-space distance of the fragment is less than the end position of the split. After determining the split index, we sample the associated shadow map with the associated texture coordinates and perform a standard shadow-map depth test.

```
sampler2D shadowMapSampler[numSplits];
void VS_RenderShadows(in float4 pos : POSITION, // Object-space coordinates
out float4 posOut : POSITION, // Clip-space coordinates
out float4 texCoord[numSplits + 1] : TEXCOORD) // Texture coordinates
{
// Calculate world position.
float4 posWorld = mul(pos, worldMatrix);
// Transform vertex.
posOut = mul(posWorld, viewProjMatrix);
// Store view-space position in the first texture
// coordinate register.
texCoord[0] = mul(posWorld, viewMatrix);
// Store shadow-map coordinates in the remaining
// texture coordinate registers.
for (int i = 0; i < numSplits; i++)
{
texCoord[i + 1] = mul(posWorld, textureMatrix[i]);
}
}
float4 PS_RenderShadows(float4 texCoord[numSplits + 1] : TEXCOORD) : COLOR
{
float light = 1.0f;
// Fetch view-space distance from first texture coordinate register.
float distance = texCoord[0].z;
for (int i = 0; i < numSplits; i++)
{
if (distance < splitPlane[i])
{
float depth = texCoord[i + 1].z / texCoord[i + 1].w;
float depthSM = tex2Dproj(shadowMapSampler[i], texCoord[i + 1]);
// Standard depth comparison
light = (depth < depthSM) ? 1.0f : 0.0f;
break;
}
}
return light;
}
```

We implemented this partially accelerated method in both DirectX 9 and OpenGL. In DirectX 9, we implemented
*percentage-closer filtering* (PCF) (Reeves et al. 1987) in the pixel shader (approximately 25
instructions), and in OpenGL we used hardware-accelerated PCF (one instruction). We compared the performance to
the multipass method when we increased the number of objects in the scene. In all tests, the number of splits
was four.

From the results shown in Figure 10-9, we can see that the DirectX 9-level acceleration increases performance. But when PCF is implemented in the pixel shader, the shaders become more complex. This increase in the shader complexity causes performance to decrease, but this becomes less significant as the scene complexity increases.

Figure 10-9 Performance Results with GeForce 6800 Ultra () and GeForce 8800 GTS ()

Although rendering scene shadows is accelerated on DirectX 9-level hardware, multiple rendering passes are still required for generating PSSMs. In other words, DirectX 9 provides us a partial acceleration of PSSMs. On the other hand, we can accelerate both generating PSSMs and rendering scene shadows on DirectX 10-level hardware.

The Direct3D 10 pipeline introduces the *geometry shader* (GS) as a new shader stage between vertex and
pixel shaders. Geometry shaders are executed once for each primitive, taking the vertices of the primitive as
input and outputting a number of new primitives. They also provide access to primitive adjacency information and
have an instruction set similar to that of vertex shaders (including support for texture sampling).

Another new feature in Direct3D 10 is the render target array, which is similar to Direct3D 9's multiple render targets (MRTs). But while MRTs only enable separate pixel shader output for each render target, the render target arrays (with the help of geometry shaders) can output entirely different geometries to each render target. This is done in the GS output, where we can specify a render target index for each primitive separately.

These features enable us to dynamically "clone" the geometry into different render targets with different transformations. We can now improve the performance of generating PSSMs because each shadow caster needs to be submitted only once.

First we need to create a texture array for the shadow maps. As shown in Listing 10-5, we use a 32-bit
floating-point texel format as usual, but we define it as "typeless" because it will need to work both as a
depth buffer and as a texture. The `ArraySize` parameter is set to the number of splits, so we have one
texture for each. Also, we must set the `BindFlags` to allow us to bind the texture array to both a
depth-stencil view and a shader resource view.

Resource views, another new feature in Direct3D 10, allow a single resource to be interpreted in different ways. In our case, we need to access the texture array both as a renderable depth-stencil surface and as a shader-accessible texture, so we must create views for both.

As shown in Listing 10-6, creating the shader resource view is simple. We set the view dimension to a texture 2D array and set the texel format to a single-channel floating-point color. Otherwise, we use the same settings as when we create the texture array.

The depth-stencil view is created in a similar manner, but the texel format must be set to a depth-buffer-compatible format, as shown in Listing 10-7.

Now we have an array of shadow maps that we can set as depth-stencil targets and bind as sampleable textures. Note that we don't need to create a render target, because we are interested in rendering only depth values, not color information.

We still have to create the needed shaders, which will be explained in the next two sub-sections.

```
D3D10_TEXTURE2D_DESC DescTex = {};
DescTex.Width = shadowMapSize;
DescTex.Height = shadowMapSize;
DescTex.ArraySize = numSplits;
DescTex.Format = DXGI_FORMAT_R32_TYPELESS;
DescTex.Usage = D3D10_USAGE_DEFAULT;
DescTex.BindFlags = D3D10_BIND_DEPTH_STENCIL | D3D10_BIND_SHADER_RESOURCE;
DescTex.MipLevels = 1;
DescTex.SampleDesc.Count = 1;
device->CreateTexture2D(...)
```

```
D3D10_SHADER_RESOURCE_VIEW_DESC DescSRV = {};
DescSRV.Format = DXGI_FORMAT_R32_FLOAT;
DescSRV.ViewDimension = D3D10_SRV_DIMENSION_TEXTURE2DARRAY;
DescSRV.Texture2DArray.ArraySize = numSplits;
DescSRV.Texture2DArray.MipLevels = 1;
device->CreateShaderResourceView(...)
```

```
D3D10_DEPTH_STENCIL_VIEW_DESC DescDSV = {};
DescDSV.Format = DXGI_FORMAT_D32_FLOAT;
DescDSV.ViewDimension = D3D10_DSV_DIMENSION_TEXTURE2DARRAY;
DescDSV.Texture2DArray.ArraySize = numSplits;
device->CreateDepthStencilView(...);
```

The rendering pipeline looks similar to the DirectX 9 version. The major difference is that we render shadow maps only once. Also, we now need to store the crop matrices of each split into an array, because we will need them all in the shaders.

More important, after finding the potential shadow casters for a split, we need each caster to keep track of
two variables: `firstSplit` and `lastSplit`. These variables determine the range of the split
indices into which the shadow caster needs to be rendered, as shown in
Figure 10-10. Note that we need to store only the
first and last indices, because a continuous bounding shape will always overlap a continuous range of split
parts.

Figure 10-10 The Shadow Caster Straddles the Split Parts from to

As we begin rendering the shadow maps, we need to set a new viewport and change the render targets. This is
shown in Listing 10-8, where the variable pDSV points to the previously created depth-stencil view. Because we
will only render depth information, we must set the number of render targets to 0 and the render target pointer
to `NULL`.

```
D3D10_VIEWPORT vp;
vp.Width = shadowMapSize;
vp.Height = shadowMapSize;
vp.MinDepth = 0;
vp.MaxDepth = 1;
vp.TopLeftX = 0;
vp.TopLeftY = 0;
device->RSSetViewports(1, &vp);
device->OMSetRenderTargets(0, NULL, pDSV);
```

Then we draw each shadow caster once, but in a special rendering loop. In this loop, we must update the shader
constants `firstSplit` and `lastSplit` with corresponding values from the shadow caster. Because
these variables are "perinstance" data, they should be handled like, for example, an object's world matrix.

In the following, we present two different approaches for dynamically cloning the geometry for different splits.

In this first method, we use the geometry shader to clone submitted triangles into different render targets. This technique is similar to the single-pass cube map technique presented in the Microsoft DirectX SDK. We first explain the details here, and then we discuss the performance advantages.

With this technique, the vertex shader simply transforms each vertex with the world matrix, the light's view matrix, and the light's projection matrix. This transformation is common for all shadow casters, regardless of which shadow maps they must be rendered into. The split-specific transformations will be applied in the geometry shader instead, as visualized in Figure 10-11.

Figure 10-11 The GPU Rendering Pipeline in the Geometry Shader Cloning Method

In other words, we now use the geometry shader to clone triangles into the split-specific render targets and then transform these triangles with the corresponding crop matrices as well.

The geometry shader code is shown in Listing 10-9, where we loop through each split to be rendered—that is,
from `firstSplit` to `lastSplit`. Inside this loop, we transform each vertex with the split's
corresponding crop matrix. We also set a render target index, and then finally we output the transformed
vertices as a new triangle. Note that the render target index is in fact specified per vertex, but only the
value in the leading (first) vertex is relevant.

```
// Geometry shader output structure
struct GS_OUT
{
float4 pos : SV_POSITION;
uint RTIndex : SV_RenderTargetArrayIndex;
};
// Geometry shader
[maxvertexcount(NUMSPLITS * 3)]
void GS_RenderShadowMap(triangle VS_OUT In[3],
inout TriangleStream<GS_OUT> triStream)
{
// For each split to render
for(int split = firstSplit; split <= lastSplit; split++)
{
GS_OUT Out;
// Set render target index.
Out.RTIndex = split;
// For each vertex of triangle
for(int vertex = 0; vertex < 3; vertex++)
{
// Transform vertex with split-specific crop matrix.
Out.pos = mul(In[vertex].pos, cropMatrix[split]);
// Append vertex to stream
triStream.Append(Out);
}
// Mark end of triangle
triStream.RestartStrip();
}
}
```

A pixel shader is not needed at this stage because we aren't interested in drawing color information. So, it
can simply be set to `NULL`.

For the sake of clarity, the implementation we presented uses separate transformations for all matrices. By
premultiplying view/projection/crop matrices together, we can implement the split-scheme PSSM(*m*) using
this method such that the number of transformations is (1 + *m*) per vertex. That is, every vertex is
transformed with the world matrix (one time) and then with the view/projection/crop matrices (*m* times).
Standard shadow-map rendering with a similar setup would use (*m* + *m*) transformations. However,
if we premultiply the world matrices as well, the number of transformations is the same (*m*) for both
methods. The advantage of the geometry shader cloning method is that it reduces the API overhead—the overhead of
extra draw calls, render target switches, and so on is removed because we submit each shadow caster only once.

The second approach for generating shadow maps is possible with the improved instancing support in Direct3D 10.
Vertex shaders can now acquire the index of an instance being rendered through the semantic
`SV_InstanceID`.

This means we can use instancing to clone our geometry for each split, and determine the split index from the instance index. We can then perform the crop matrix transformation in the vertex shader, so that the only task left for the geometry shader is setting the render target index, as shown in Listing 10-10 and visualized in Figure 10-12.

Figure 10-12 The GPU Rendering Pipeline in the Instancing Method

To render with instancing, we need to call the function `DrawIndexedInstanced()` with the number of
instances set to `lastSplit` - `firstSplit` + 1. This is the only change needed on the CPU side;
it is not necessary to set up any extra vertex streams.

```
struct VS_IN
{
float4 pos : POSITION;
uint instance : SV_InstanceID;
};
struct VS_OUT
{
float4 pos : POSITION;
uint split : TEXTURE0;
};
VS_OUT VS_RenderShadowMap(VS_IN In)
{
VS_OUT Out;
// Transform with world/view/projection.
Out.pos = mul(In.pos, ...);
// Determine split index from instance ID.
Out.split = firstSplit + In.instance;
// Transform with split-specific crop matrix.
Out.pos = mul(Out.pos, cropMatrix[Out.split]);
return Out;
}
[maxvertexcount(3)]
void GS_RenderShadowMap(triangle VS_OUT In[3],
inout TriangleStream<GS_OUT> triStream)
{
GS_OUT Out;
// Set render target index.
Out.RTIndex = In[0].split;
// Pass triangle through.
Out.pos = In[0].pos;
triStream.Append(Out);
Out.pos = In[1].pos;
triStream.Append(Out);
Out.pos = In[2].pos;
triStream.Append(Out);
triStream.RestartStrip();
}
```

Using this method, we again remove the extra API overhead. However, the number of transformations is the same as with standard shadow-map rendering. Still, the instancing method may be faster than using geometry shader cloning because expanding large amounts of data with the geometry shader may be expensive.

We can also use the cloning and instancing methods together, so that both are used to generate geometry for a portion of the splits. For example, in the split-scheme PSSM(4), instancing could be used for two splits and geometry shader cloning for another two splits.

The process of rendering shadows is nearly the same as in DirectX 9. The main difference is in the sampling of
the texture array. To do things the traditional way, we can use the function `SampleLevel()` to sample
from a given texture of the array.

Using this function is straightforward, but note that the second parameter is a `float3`, where the
first two floats are for the UV coordinates and the third float is for the texture index. The mipmap level is
defined in the function's third parameter.

After sampling the correct texture with this function, we would perform the depth comparison as usual. However, this is not the optimal method, especially if we wish to use PCF. We present a better approach next, which uses cube maps.

HLSL 4 introduces a new function called `SampleCmpLevelZero()`, which performs sampling and comparison
simultaneously. It can also be used with a linear comparison filter to get PCF, which means the function will
take four samples, compare each separately, and then bilinearly filter the results.

Unfortunately, `SampleCmpLevelZero()` cannot currently be used with Texture2DArrays. It can, however, be
used with TextureCubes. TextureCubes are normally used for cube maps, but they are essentially a texture array
with six textures, one for each face of the cube.

This means we can use a TextureCube for up to six splits. Using a TextureCube requires only a few changes in the setup, as shown in Listing 10-11. Rendering shadow maps works the same way as before.

Sampling is slightly complicated by the fact that a cube map is not accessed with standard texture coordinates,
but instead with a vector pointing to a face of the cube. This cube is a unit cube, and it is centered in the
origin. The faces are ordered so that from array index 0 to 5, the faces are located at +*x*, -*x*,
+*y*, -*y*, +*z*, and -*z*, respectively. From this we can determine the correct
direction vector for accessing each face, as shown in Listing 10-12.

However, the approach shown in Listing 10-12 is not necessary, because we can separate the different cases into three lookup arrays, as shown in Listing 10-13, which slightly increases the pixel shader's performance.

Another issue is that border color addressing mode cannot be used, because the borders are simply mapped onto another face of the cube map. However, we can simulate the border color addressing mode by avoiding sampling altogether if the coordinates are outside the valid range of texture coordinates. This is also shown in Listing 10-13.

In our tests, using `SampleCmpLevelZero()` with a TextureCube slightly improved our performance compared
to using a Texture2DArray with four-tap PCF implemented in the pixel shader (approximately 25 instructions).
However, with a TextureCube, six textures will always be allocated even if fewer are needed, often resulting in
wasted memory.

```
// When creating the texture array
DescTex.ArraySize = 6;
DescTex.MiscFlags = D3D10_RESOURCE_MISC_TEXTURECUBE;
. . .
// When creating the shader resource view
DescSRV.ViewDimension = D3D10_SRV_DIMENSION_TEXTURECUBE;
DescSRV.TextureCube.MipLevels = 1;
```

```
float3 cubeCoord;
if (split == 0)
cubeCoord = float3(0.5, 0.5 - pos.y, 0.5 - pos.x);
else if (split == 1)
cubeCoord = float3(-0.5, 0.5 - pos.y, pos.x - 0.5);
else if (split == 2)
cubeCoord = float3(pos.x - 0.5, 0.5, pos.y - 0.5);
else if (split == 3)
cubeCoord = float3(pos.x - 0.5, -0.5, 0.5 - pos.y);
else if (split == 4)
cubeCoord = float3(pos.x - 0.5, 0.5 - pos.y, 0.5);
else if (split == 5)
cubeCoord = float3(0.5 - pos.x, 0.5 - pos.y, -0.5);
```

```
static const float3 offset[6] = {
float3(0.5, 0.5, 0.5), float3(-0.5, 0.5, -0.5),
float3(-0.5, 0.5, -0.5), float3(-0.5, -0.5, 0.5),
float3(-0.5, 0.5, 0.5), float3(0.5, 0.5, -0.5)};
static const float3 mulX[6] = {
float3(0, 0, -1), float3(0, 0, 1), float3(1, 0, 0),
float3(1, 0, 0), float3(1, 0, 0), float3(-1, 0, 0)};
static const float3 mulY[6] = {
float3(0, -1, 0), float3(0, -1, 0), float3(0, 0, 1),
float3(0, 0, -1), float3(0, -1, 0), float3(0, -1, 0)};
SamplerComparisonState shadowMapSampler
{
ComparisonFunc = Less;
Filter = COMPARISON_MIN_MAG_LINEAR_MIP_POINT;
};
float4 PS_RenderShadows(PS_INPUT In) : SV_Target
{
float light = 1.0f;
for(int split = 0; split < numSplits; split++)
{
if(In.distance > splitEnd[split])
{
float4 texpos = In.texturePos[split];
texpos.xyz /= texpos.w;
// Cube map vector lookup
float3 cubeCoord = offset[split] +
mulX[split] * texpos.x +
mulY[split] * texpos.y;
// Don't sample outside the border.
if(min(pos.x, pos.y) > 0 && max(pos.x, pos.y) < 1)
{
light = shadowMapCube.SampleCmpLevelZero(shadowMapSampler,
cubeCoord, texpos.z);
}
break;
}
}
return light;
}
```

To improve the shadow quality with PSSMs, use the following optimizations:

**Filtering Techniques.***Percentage-closer filtering*(PCF) helps antialiasing in shadow mapping by performing multiple depth tests for each pixel, which makes the shadow boundaries look considerably smoother.*Variance shadow maps*(VSMs) (Donnelly and Lauritzen 2006) treat shadow map texels as a distribution of depths and represent this distribution in a way that can be linearly filtered. Chebyshev's inequality is used to approximate the occlusion over an arbitrary filter region.**Packing of Textures.**Instead of storing PSSMs into multiple textures in hardware-accelerated implementations, users may pack them into a single texture. This is a possible solution when few texture samplers are available. In DirectX 10, texture arrays can be utilized instead.**Warping Algorithms.**An interesting application of the split scheme is applying warping algorithms (for example, PSMs, LiSPSMs, and TSMs) separately in split parts.**Varying Texture Resolution.**All shadow maps in our current implementation of PSSMs have the same resolution. To reduce the required consumption of textures, users can adjust the size of each shadow map at different split layers.**Linearized Depth Distribution**(Brabec et al. 2002). Using a linear depth metric instead of standard projected*z*increases the precision of discrete depth quantization for point lights.

The screen shots in Figures 10-13 through
10-16 from *Dawnspire: Prelude*—displayed
throughout the remainder of the chapter—were rendered using the PSSM(3; 1Kx1K) scheme (multipass method) at an
image size of 1280x1024. The performance is measured on a PC configured with an AMD Athlon64 X2 3800 CPU, 1 GB
DDR RAM, and a GeForce 7800 GTX GPU. The number of visible triangles processed in these figures ranges from 165
K to 319 K. In our experiments, both PSSM(3) and PSSM(4) achieve a good trade-off between the performance and
visual qualities (more than 31 frames/sec and 27 frames/sec in the two schemes, respectively).

Figure 10-13 Screen Shot from

Figure 10-14 From

Figure 10-15 From

Figure 10-16 From

In many applications, especially for complex and large-scale scenes, rendering high-quality shadows is crucial for the users' perception of realism. Even without hardware acceleration, it is worth the cost of multipass rendering to dramatically improve shadow quality.

The basic idea of parallel-split shadow maps is intuitive and simple to implement, as with our multipass method. We can quickly determine the split positions using the practical split scheme without complicated scene analysis. However, without using hardware acceleration, the performance drop caused by multiple rendering passes prevents this technique from being extensively used in mass-market applications.

In this chapter, we present two hardware-accelerated methods: partially accelerated implementation on DirectX 9-level GPUs and fully accelerated implementation on DirectX 10-level GPUs. The extra rendering passes required for synthesizing shadows are avoided in the partially accelerated implementation. In the DirectX 10 rendering pipeline, our fully accelerated implementation reduces the cost of extra rendering passes for rendering both shadow maps and scene shadows.

There are many other ways to improve performance and shadow quality in our PSSMs technique, but one thing is certain: With the rapid evolution of GPUs, PSSMs are a promising approach for high-quality shadow rendering.

Demos with full source code can be found on the book's accompanying DVD. All three implementation methods presented in this chapter are included.

Brabec, Stefan, Thomas Annen, and Hans-Peter Seidel. 2002. "Practical Shadow Mapping."
*Journal of Graphical Tools* 7(4), pp. 9–18.

Donnelly, William, and Andrew Lauritzen. 2006. "Variance Shadow Maps." In
*Proceedings of the Symposium on Interactive 3D Graphics and Games 2006*, pp. 161–165.

Lloyd, Brandon, David Tuft, Sung-Eui Yoon, and Dinesh Manocha. 2006. "Warping and Partitioning for Low Error
Shadow Maps." In *Proceedings of the Eurographics Symposium on Rendering 2006*, pp. 215–226.

Martin, Tobias, and Tiow-Seng Tan. 2004. "Anti-aliasing and Continuity with Trapezoidal Shadow Maps." In
*Proceedings of the Eurographics Symposium on Rendering 2004*, pp. 153–160.

Reeves, William, David Salesin, and Robert Cook. 1987. "Rendering Antialiased Shadows with Depth Maps." In
*Computer Graphics (Proceedings of SIGGRAPH 1987)* 21(3), pp. 283–291.

Stamminger, Marc, and George Drettakis. 2002. "Perspective Shadow Maps." In
*ACM Transactions on Graphics (Proceedings of SIGGRAPH 2002)* 21(3), pp. 557–562.

Tadamura, Katsumi, Xueying Qin, Guofang Jiao, and Eihachiro Nakamae. 2001. "Rendering Optimal Solar Shadows
with Plural Sunlight Depth Buffers." *The Visual Computer* 17(2), pp. 76–90.

Williams, Lance. 1978. "Casting Curved Shadows on Curved Surfaces." In
*Computer Graphics (Proceedings of SIGGRAPH 1978)* 12(3), pp. 270–274.

Wimmer, Michael, Daniel Scherzer, and Werner Purgathofer. 2004. "Light Space Perspective Shadow Maps." In
*Proceedings of the Eurographics Symposium on Rendering 2004*, pp. 143–152.

Zhang, Fan, Hanqiu Sun, Leilei Xu, and Kit-Lun Lee. 2006. "Parallel-Split Shadow Maps for Large-Scale Virtual
Environments." In
*Proceedings of ACM International Conference on Virtual Reality Continuum and Its Applications 2006*, pp.
311–318.

Zhang, Fan, Hanqiu Sun, Leilei Xu, and Kit-Lun Lee. 2007. "Hardware-Accelerated Parallel-Split Shadow Maps."
*International Journal of Image and Graphics*. In press.

*All screen shots are from* Dawnspire: Prelude
*(http://www.dawnspire.com), courtesy of
Silent Grove Studios. Thanks to Anders Hammervald
(anders@hammervald.com) for his sincere help with preparing the
images. The models used in the illustrative figures are downloaded from
http://www.planetquake.com.*

- Contributors
- Foreword
- Part I: Geometry
- Chapter 1. Generating Complex Procedural Terrains Using the GPU
- Chapter 2. Animated Crowd Rendering
- Chapter 3. DirectX 10 Blend Shapes: Breaking the Limits
- Chapter 4. Next-Generation SpeedTree Rendering
- Chapter 5. Generic Adaptive Mesh Refinement
- Chapter 6. GPU-Generated Procedural Wind Animations for Trees
- Chapter 7. Point-Based Visualization of Metaballs on a GPU
- Part II: Light and Shadows
- Chapter 8. Summed-Area Variance Shadow Maps
- Chapter 9. Interactive Cinematic Relighting with Global Illumination
- Chapter 10. Parallel-Split Shadow Maps on Programmable GPUs
- Chapter 11. Efficient and Robust Shadow Volumes Using Hierarchical Occlusion Culling and Geometry Shaders
- Chapter 12. High-Quality Ambient Occlusion
- Chapter 13. Volumetric Light Scattering as a Post-Process
- Part III: Rendering
- Chapter 14. Advanced Techniques for Realistic Real-Time Skin Rendering
- Chapter 15. Playable Universal Capture
- Chapter 16. Vegetation Procedural Animation and Shading in Crysis
- Chapter 17. Robust Multiple Specular Reflections and Refractions
- Chapter 18. Relaxed Cone Stepping for Relief Mapping
- Chapter 19. Deferred Shading in Tabula Rasa
- Chapter 20. GPU-Based Importance Sampling
- Part IV: Image Effects
- Chapter 21. True Impostors
- Chapter 22. Baking Normal Maps on the GPU
- Chapter 23. High-Speed, Off-Screen Particles
- Chapter 24. The Importance of Being Linear
- Chapter 25. Rendering Vector Art on the GPU
- Chapter 26. Object Detection by Color: Using the GPU for Real-Time Video Image Processing
- Chapter 27. Motion Blur as a Post-Processing Effect
- Chapter 28. Practical Post-Process Depth of Field
- Part V: Physics Simulation
- Chapter 29. Real-Time Rigid Body Simulation on GPUs
- Chapter 30. Real-Time Simulation and Rendering of 3D Fluids
- Chapter 31. Fast N-Body Simulation with CUDA
- Chapter 32. Broad-Phase Collision Detection with CUDA
- Chapter 33. LCP Algorithms for Collision Detection Using CUDA
- Chapter 34. Signed Distance Fields Using Single-Pass GPU Scan Conversion of Tetrahedra
- Chapter 35. Fast Virus Signature Matching on the GPU
- Part VI: GPU Computing
- Chapter 36. AES Encryption and Decryption on the GPU
- Chapter 37. Efficient Random Number Generation and Application Using CUDA
- Chapter 38. Imaging Earth's Subsurface Using CUDA
- Chapter 39. Parallel Prefix Sum (Scan) with CUDA
- Chapter 40. Incremental Computation of the Gaussian
- Chapter 41. Using the Geometry Shader for Compact and Variable-Length GPU Feedback
- Preface