We have learned in the previous chapters about how to represent scenes and objects with triangle meshes. It is now time to learn about how we transform these representations into images that will be consumed by users. As mentioned previously in Chapter 1, the technical word for this process is "rendering." We have also mentioned the only rendering algorithm we will use in this book, the "graphics pipeline," and its particular implementation, "WebGL," several times. In this chapter, we will discuss what they are and how they work.
First, let us specify precisely what a rendering algorithm is. A rendering algorithm takes as input (1) a description of a 3D scene and (2) a description of how the scene is viewed by an observer. It then outputs an image that depicts the scene through the aforementioned viewpoint. Metaphorically, a rendering algorithm is thus the inner working of a camera that takes snapshots of the scene.
Let us take a closer look at the output. We learned in Chapter 2 that, most of the times, images that a rendering algorithm produces would be raster images. A raster image is made of a number of pixels. Hence, the main task of a rendering algorithm is to figure out the colors of all the pixels in the output color images. We also learned that the description of the scene defines an implicit continuous image from which we must construct the raster image from. A continous image contains much more information than a raster image can ever store, so a rendering algorithm is practically "summarizes" the continuous image into a raster one.
Now, looking closer at the input, a scene would consist of a number of objects. We have decided that each of these objects would be represented by meshes that are in turn made out of primitives. We can think of these primitives as emitting light which has color, and this color should be recorded by a rendering algorithm in the output image if the observer can see the primitive from the input viewpoint. There are several reasons why a primitive, or some parts of it, may not be recorded by a rendering algorithm. First, the primitive may be out of the viewpoint's "field of view." For examples, a primitive that is behind the camera would not be seen by the camera. Second, the primitive might be behind another primitive from the camera's point of view. As a result, rendering involves figuring out, for each pixel, which primitive we should take the color from.
In computer graphics, there are two main ways to organize computation to answer the question of "which primitive contributes to which pixel." This results in two types of rendering algorithms: image-order rendering algorithms and object-order rendering algorithms. Simplified pseudocode for the algorithms is given in Figure 7.1.
Image-order rendering algorithm |
|
Object-order rendering algorithm |
|
The two algorithms are very similar: two for loops with the same loop body. The only difference between them is the order of the loops. Image-order algorithms' outer loops cycle through the pixels, while those for object-order algorithms cycle through the primitives. For the pseudocode in Figure 7.1, the two algorithms are equivalent. However, more advanced versions of the algorithms have more sophisticated ways of pruning useless pixel-primitive pairs to make computation faster.
One of the most well-known image-order rendering algorithm is ray casting. The algorithm posits that the color of a pixel is the color of the light that travels along the ray that passes through the center of the pixel in the image plane. (Recall Figure 2.3.) So, figure out the color, it finds the first primitive that intersects the ray and take the color from that primitive. A more sophisticated version of ray casting called path tracing takes into account that primitives can also reflects light that comes from light source or other primitives, and so it can spawn more rays from the intersection point until it finds a light source. Path tracing can generative high-quality rendering that looks photo-realistic, and it is the basis of rendering algorithms used in CG-animated movies and special effect shots in live-action films. Image-order rendering algorithms are usually implemented entirely in software, making it difficult for them to meet the strigent speed requirement of interactive applications.
The "graphics pipeline" is the most well-known example of object-order rendering algorithms, and we will describe it in more details in the next section. In constrast to ray casting and path tracing, it has hardware that is designed specifically to run it: the graphics processing units (GPUs).1 Hardware acceleration makes it possible to render complex scenes in real-time, enabling applications such as computer games and interactive visualization.
Implementations of the graphics pipeline usually come in the form of Application Programming Interfaces (APIs), which are libraries of functions and data structures that programmers can use to build their own applications. Well-known implementations include DirectX, Vulcan, and Metal. In this book, we will focus on OpenGL and its web-based offshoot, WebGL. These implementations all rely on GPUs to perform almost of all its operations. As a result, they can be regarded as interfaces to the GPUs that allow 3D rendering.
The graphics pipeline takes as input a specification of a scene. Here, unlike what we discussed previously, the specification of the camera and the camera's viewpoint is implicitly included in the scene specifiction, and we will discuss how this inclusion is accomplished momentarily. The graphics pipeline assumes that objects in the scene are represented by meshes. Recall from Chapter 6, that a mesh is a collection of vertices and a list of primitives that are constructed from these vertices. A scene can also have image data called "textures," but we will not discuss them until Chapter XXX. The graphics pipline must then produces a raster image of the scene. The memory that stores the output image is called the framebuffer. It is a special area of GPU memory that the color values written there would be displayed on the monitor connected to the GPU.
The graphics pipeline convert meshes into a raster image with the following 5-step process.
An overview of what each step does is shown in Figure 7.2. Let us now discuss each of the step in more details.
The main goal of this step is to process vertex attributes so that they are easy for later steps to work on. The most important attribute to process is the position of the vertex. At the end of vertex processing, all positions must be transformed so that they becomes normalized device coordinates (NDCs). NDCs are coordinates in the normalized device coordinate system: a 3D coordinate system such that anything outside a certain volume is not visible in the output image. Different implementations of the graphics pipeline have different definitons of the "visible volume." For OpenGL and WebGL, it is the cube $\{ (x,y,z) : -1 \leq x,y,z \leq 1\}$. For DirectX, it is the box $\{ (x,y,z) : -1 \leq x, y \leq 1, 0 \leq z \leq 1\}$.
We said earlier that the graphics pipeline assumes that the viewpoint is implicitly included in the scene description. What we meant by this is that, in modern implementation of the graphics pipeline, it is the responsibility of the programmer to customize the vertex processing step so the transformation of vertex positions into NDCs would take into account the viewpoint and the camera. This customization is often done through writing a piece of code called the vertex shader, which can be thought of as a part of the scene's specification. We will discuss how to write vertex shaders to implement two frequently-used types of cameras in Chapter XXX. For now, just remember that simulating cameras can be done inside a vertex shader, and it is our responsibility to do so.
The primitive assembly step process information about primitives. Mainly, it does the following.
Modern implementations of the graphics pipeline also supports customization of the primitive assembly steps by allowing the programmer to write tessllation shaders and geometry shaders. However, we will not cover these aspects of the graphics pipeline in this book because they are not currently supported by WebGL.
Primivites that survive face culling (meaning that they are potentially visible in the output image if not occluded by other primitives) are sliced into rectangular areas, each has the size of a pixel. This process is called rasterization, and each generated rectangular area is called a fragment. A fragment can be thought of as a "potential pixel," whose visibility will be determined in a later step of the pipeline. Unlike the previous two steps, this step is often not customizable.
An important function of the rasterization step is interpolation of vertex attributes. Recall from Chapter 5 and Chapter 6 that attributes such as color and vertex normals are interpolated when the graphics pipeline needs to fill in the space between the vertices. It is through rasterization that this interpolation is performed. We will discuss the specifics of how interpolation is performed later when we discuss OpenGL and WebGL.
This step processes all the fragments generated by the last step with the following two goals:
Recall from Section 7.5.2 that the "scissor test" is
The fragment processing step can produce multiple fragments that occupy the same pixel location. The raster operation step process these fragments into a single color value. The step can be divided into two important substeps.
Different implementations of the graphics pipeline may differ on how they implement the raster operation step. Some may have more culling tests, blending operations, or add entirely new operations altogether.
OpenGL stands for "Open Graphics Library." It is originally developed by Silicon Graphics, a well-known tech company that unfortunately went bankrupt in 2009. A organization called the Khronos Group has been responsible for development and maintenance of OpenGL since 2006.
"OpenGL" refers to the library that is written in the C programming language and intended to be used to develop native applications for personal computers. A spinoff library called OpenGL for Embedded Systems (abbreviated as "OpenGL ES") was released in 2003. It targets less powerful devices such as video game consoles, tablets, and smartphones.
WebGL is a spinoff of OpenGL ES for web applications. It is packaged as a Javascript library and is generally available in modern web browsers such as Chrome, Firefox, Safari, and Edge without the need to install any extra software. In this book, we will use WebGL version 2.0, which has now been supported by all the aforementioned browsers since 2022.
Now, let us discuss details of the graphics pipeline's implementation that is specific to WebGL.
There are two parts to writing WebGL programs. The first part is using Javascript to call functions in the WebGL to do its job. This includes preparing an area in the web page that WebGL will draw the primitives to, sending the data to the GPU, altering various settings of the graphics pipeline, and then asking WebGL to draws the primitives we want to see. We will discuss how to use these functions to create the simplest computer graphics programs in the next Chapter.
The second part is customizing the graphics pipeline by writing GLSL programs. Here, a GLSL program is an object that contains as its constituent parts at least two shaders that WebGL 2.0 allows you to write: the vertex shader (Section 7.2.1) and the fragment shader (Section 7.2.4). Unlike the Javascript code of the first part that runs on the CPU, shader code is compiled into machine code that runs on the GPU. Shaders are written in a programming language called the OpenGL Shading Language, abbreviated as "GLSL." Like WebGL, there are multiple versions of the GLSL language. In this book, we will use GLSL ES version 3.0, which is the latest version supported by WebGL 2.0. The syntax of the GLSL language is similar to languages in the C family, and we will discuss it in more details in Chapter XXX.
Now, we mentioned earlier that modern implementations of the graphics pipeline allows the user to customize the pipeline to their needs. This kind of implies that there can be default behaviors to which the implication would fall back to if the user does not do any customization. However, for WebGL 2.0, customizing the graphics pipeline with a GLSL program is a requirement if one wants to do anything more complicated than clearning the screen. As a result, if we want to draw any primitive at all, we must write two shaders in GLSL, combine them into a GLSL program, and have WebGL execute it for us.
Recall that a vertex shader is a piece of code that customizes the vertex processing step of the graphics pipeline. Moreover, the main output of the vertex processing step is the NDCs corresponding to the vertex positions. WebGL has its own convention on how the NDCs are computed, and we must understand it in order for us to write effective shaders.
The main data that WebGL expects a vertex shader to output is the positions of the vertices in clip space, and these positions are referred to as clip space coordinates. Clip space coordinates are homogeneous coordinates (Section 4.3 of Chapter 4), which means that they have 4 components: $x$, $y$, $z$, and $w$. WebGL will convert clip space coordinates to NDCs by itself, and this part the user cannot change.
The process that WebGL computes NDCs from clip space coordinates is called the perspective divide. Suppose that the clip space coordinates that the vertex shader outputs is $P_{\mrm{clip}} = (x_{\mrm{clip}}, y_{\mrm{clip}}, z_{\mrm{clip}}, w_{\mrm{clip}})$. WebGL would compute the NDC, denoted by $P_{\mrm{ndc}} = (x_{\mrm{ndc}}, y_{\mrm{ndc}}, z_{\mrm{ndc}})$ by dividing all clip space coordinates with the $z$-coordinate: \begin{align*} \begin{bmatrix} x_{\mrm{ndc}} \\ y_{\mrm{ndc}} \\ z_{\mrm{ndc}} \end{bmatrix} = \begin{bmatrix} x_{\mrm{clip}} / w_{\mrm{clip}} \\ y_{\mrm{clip}} / w_{\mrm{clip}} \\ z_{\mrm{clip}} / w_{\mrm{clip}} \end{bmatrix}. \end{align*} After the NDCs are computed, they would then be used in primitive assembly.
A question that might come to the reader's mind right now is how we are supposed to write vertex shaders so they they output the right clip space coordinates. The answer is that there is a standard way to do it, and that computer graphics applications rarely deviate from this method. It involves multiplying the coordinates with matrices and the technical term for this is "transforming" them. There are three main types of transformations. One is used to put objects into a scene (modeling transformation), one is used to set up the camera's viewpoint (view transformation), and the last is used to project 3D objects to a 2D plane (projective transformation or projection). We will learn about the first two types of transformations in Chapter XXX and Chapter XXX, and we will learn about projections in Chapter XXX.
However, for the next few chapters, we will not have the tools to do the "standard transformation pipeline" above yet. We will, for now, write vertex shaders so that they directly output clip space coordinates. Moreover, we will always set the $w$-coordinate to $1$ so that the NDCs are equal to the $xyz$-components of the clip space coordinates. Recall that WebGL considers any vertices whose NDCs are out of the cube $\{(x,y,z) : -1 \leq x,y,z \leq 1\}$
to be invisible to the camera. As a result, we must make sure that all primitives we want to show have positions in the $[-1,1]$ range.Other than the clip space coordinates of the vertex positons, the vertex shader can output other vertex attributes, and these are typically refer to as varying variables, and we will learn how to use them in Chapter XXX. These varying variables are interpolated by the rasterization step and then passed to the fragment shader.
Recall againt that the rasterization step chops primivites into square areas, each having a size of a pixel, called framents. With each fragment, it attaches interpolated values of varying variables. Then, the graphics pipeline would transition to the fragment processing step, invoking the fragment shader on each fragment in order to process them.
The fragment shader receives as input interpolated varying variables. It also have access to image data called "textures" (Chapter XXX) that can be set up by Javascript code. It must then process these information and perform one of the two actions previously discussed in Section 7.2.4.
Because fragment shaders decide the fragment colors, it plays a central role in determining how objects appear in images. Hence, it must take into account information that are related to appearance, and these include the object's colors, textures, and how the object's surface reflect or emit light. It must also take into account the intensity and the direction in which light comes to illuminate the object. We will learn about how to model light interaction in Chapter XXX and Chapter XXX.
While GPUs are designed to run the graphics pipeline, modern GPUs can run many other types of algorithms. Currently, GPUs are the main workhorse of artificial intelligence because it can execute and train neural networks fast.