# Orthogonal Projection & Parallel Computing

Since I started studying machine learning, I have touched on many exciting fields of computer science and mathematics. I find them enjoyable to learn, even outside of the context of machine learning.

In this article, I will talk about some ideas I have on applying linear algebra to speed up certain kind of mathematical CUDA kernels. Checkout this GitHub project for a sample implementation.

1. What's the goal?
2. Idea
3. Theory
4. Implementation
5. Performance Testing Result
6. Challenge
7. Conclusion

## I. What's the goal?

The goal of this article is to find a way to speed up some complicated scientific computation on GPU. General purpose GPU computing is used in many fields nowadays: machine learning, bioinformatics, physics, etc. However, GPU programming is very different from CPU programming. Let's take a look.

### 1.1 GPU vs. CPU

Let me start by asking this question: which one of the following implementation would you prefer?

// Implementation 1
for (size_t i{0}; i < 4; ++i) ++arr[i];

// Implementation 2
++arr;
++arr;
++arr;
++arr;

Most of the people would choose the first one. When programming for CPU, the first one is for sure better. However, on a GPU function, the first implementation is about 30% slower than the first one in reality (75% in theory), because of warp divergence.

The reason for such a big difference, is that CPUs and GPUs are built for different purposes: CPUs are built to handle complicated logic, GPUs are built for mass parallel computing. CPU handles logical decisions all the time (e.g., register access, cache management, pointer access, etc.); GPU is not good at logic, but it's excellent at executing a single task on a large amount of data at the same time. You can think of CPU as a Ferrari, and GPU as a mega-bus.

At the time of this writing, the dominant GPU hardwares for scientific computing mostly come from NVIDIA, and the parallel computing platform on NVIDIA GPU is called CUDA. The programming language for CUDA is an extension to C++, with the capability to control parallel computing on the GPU. Because CUDA is so popular in the world of scientific computing, in this article we will only use CUDA as our platform.

Before we move on, let's deal with the fact that the word "CPU" and "GPU" is only one letter away, that makes them hard to distinguish. For the rest of this article, I will use terms in CUDA to identify them: the host refers to the CPU and its memory, while the device refers to the GPU and its memory.

### 1.2 CUDA Kernel

This article is not meant to be an introduction to CUDA programming, so I will only cover what's required to understand our problem and solution. If you're interested learning CUDA programming, take a look at NVIDIA's tutorial, there are plenty of good resource on their website.

We call functions executed on the device a kernel. Kernels are very sensitive to poor-written program, because they are executed on a light weight computing unit called Streaming Multi-core Processor (SM).

Here is an example of a simple CUDA program:

#include <iostream>
#include <cstdio>

// Define a kernel to print "Hello!" and  some thread information.
__global__ void SayHello() {
printf("Hello! (b_idx: %d, t_idx: %d)\n", blockIdx.x threadIdx.x);
}

int main(int argc, const char* argv[]) {
std::cout << "Starting GPU printing." << std::endl;
// Call SayHello on 2 blocks, each block contains 3 threads.
SayHello<<<2, 3>>>();
std::cout << "GPU printing finished." << std::endl;
return 0;
}

In this example, SayHello is our kernel. It doesn't take any parameters, and returns void. In fact, all kernels must return void. The built-in macro defined keyword __global__ makes it a kernel function, which means it will be called on the host, and executed on the device. blockIdx and threadIdx are of type dim3, each contains a 3-dimensional "position" information of current block and thread, they are also built-in for each thread.

The output of the program will be something like this (the order may be different):

Starting GPU printing.
Hello! (b_idx: 0, t_idx: 1)
Hello! (b_idx: 0, t_idx: 0)
Hello! (b_idx: 1, t_idx: 2)
Hello! (b_idx: 0, t_idx: 0)
Hello! (b_idx: 1, t_idx: 1)
Hello! (b_idx: 1, t_idx: 2)
GPU printing finished.

One of the important task in CUDA programming is kernel optimization. There are mainly two areas you can focus on for optimization: arithmetic operation, and memory access operation. In this article, we will be focusing on optimizing arithmetic operation. However, memory access on the device will influence our testing result, so we will mention it if necessary. For now, just remember that we only care about arithmetic operations inside of our device functions.

### 1.3 Sample Problem

Now, let's get our hands dirty: define a problem and start working on it. Suppose we have a function: $$f(x)=\frac{1}{2\pi}+\frac{cos^3(x-\pi)}{3\pi}+\frac{(x-\pi)^2}{9\pi}sin^3(x-\pi),x\in[0,2\pi]$$

Here's a plot of f, and we're interested in its integral, the orange highlighted area:

If you pay attention, you can notice some minor curvatures on this plot's curve, but the actual plot is smoother. Because of the limited computing power of a web browser, I had to choose a limited number of samples for x in order to generate this plot live, and make it look fine on all screen sizes. Please ignore these random noise, this applies to all function plots on this page.

Assume f(x) is a PDF (probability density function) of some continuous random variable x, you can verify that \int_{0}^{2\pi}f(x)dx = 1. We also have a couple gigabytes (or teribytes) of data, as a matrix A\in\R^(m\times2), where A_(i,j)\in[0,2\pi]. Now our goal is to compute CDF (cumulative distribution function) Pr(A_(i,0) < x<=A_(i,1)), \forall i\in\Z^\star, and write the result to an array b\in\R^m.

By doing a little bit calculus (with WolframAlpha), we get:

\begin{align} F(x) = \int f(x)dx \approx -\frac{1}{9\pi}[(-0.75x^2+4.71x-5.9)cos(x)+(0.083x^2-0.52x+0.8)cos(3x)\\-4.5x+1.5x\times sin(x)-0.056x\times sin(3x)-2.46sin(x)+0.42sin(3x)] \end{align}

Since Pr(A_(i,0) < x<=A_(i,1)) = F(A_(i,1)) - F(A_(i,0)), we will have to compute F(x) twice on our kernel. Below is a sample implementation on device side:

// __device__ functions can be called from a kernel.
__device__ float Integral(float const val)
{
// __sinf and __cosf are intrinsics functions can be called only from
// device, they are highly optimized for device execution.
return (-0.3537f) * ((-0.75f * val * val + 4.71f * val - 5.9f) *
__cosf(val) + (0.083 * val * val - 0.52f * val + 0.8f) *
__cosf(3.0f * val) - 4.5f * val + 1.5f * val * __sinf(val) -
0.056f * val * __sinf(3.0f * val) - 2.46f * __sinf(val) +
0.42f * __sinf(3.0f * val));
}

__global__ void CDF(const float * const data_in,
float * const data_out,
size_t const size)
{
size_t const idx{threadIdx.x + blockIdx.x * blockDim.x};
size_t const idx_2{idx * 2};
if (idx_2 < size) {
float const x1{data_in[idx_2]};
float const x2{data_in[idx_2 + 1]};

float const res1{Integral(x1)};
float const res2{Integral(x2)};

data_out[idx] = res2 - res1;
}
}

Here is an image of a SM unit on NVIDIA's Pascal architecture. __sinf and __cosf are both executed by SFUs (special function units). Simply by counting the number of cores and SFUs, one can guess that those trigonometry function calls are slower than FPOs (floating point operations). We will see some testing results later.

The above device function CDF consists of 29 FPOs and 6 special function calls. When calculating integral, it's doubled to 58 FPOs and 12 special functions calls, which is pretty heavy for a kernel.

Is there a way to approximate this complicated function with a simple polynomial? If so, would the alternative polynomial implementation make it faster? The rest of the article will be dedicated to answer these questions.

Although this samples question is about converting a arbitrary function into a polynomial, the method we are about to discuss works for any kind of function conversion (with some limitations, we will discuss them later). For example, a complicated polynomial may be converted to a simple trig function, so that although there is an overhead for trig function calls, the kernel will be faster because the amount of FPOs avoided.

## II. Idea

By looking at the graph of F, we can guess there is a polynomial q \in P_n(x) that can approximate \F pretty well. The question now is how can we find this polynomial? The most straight forward way would be using gradient based methods.

P_n(x) denotes the set of all polynomials with degree less than or equal to n.

We can look at our function approximation problem as a linear regression problem.

Generate a vector v = (x_0,\cdots,x_m), x_i = \frac{2i\pi}{m}, where m\in\N. Then calculate F(x_j), \forall x_j \in v, then write them to an array y \in \R^m. After that, we create a matrix U\in\R^(m\times(n+1)), where U_(\ast,0) = 1 and U_(i, j) = (x_i)^j, \forall i, j > 0.

Now, we can use U as our "training data", y as our "training output", train a linear regression model L and purposely make it overfit (by not using any regularization). After the training, we can pull the coefficients in L, which should be the standard coefficients for the approximated polynomial with our given degree n.

I wrote an implementation example with Python for this kind of gradient based method. Although gradient based method is able to approximate a function fairly well,, it has some pitfalls.

First, you need to manually choose a "training sample size" m as how many training data you want to generate. The larger m is, the more accurate the approximation turns out to be. However, a large m will result in more training time, and the appropriate choice of m depends on the source function itself and its domain. Also, as the degree of target polynomial grows, our input matrix U can get larger and larger, our training algorithm would have a time and space complexity of O(m \times n). That's not good.

### 2.2 Orthogonal Projection

Let's find another solution, which doesn't require artificially generated data, also guarantees to find the most optimal function. We will see how to do this by utilizing orthogonal projection in linear algebra.

## III. Theory

Many undergraduate level linear algebra courses use a matrix heavy approach to introduce the topic, for a more abstract view, I found Sheldon Axler's Linear Algebra Done Right an excellent textbook material. In this chapter I will present some basic definitions from this book, and how to use them to solve our problem.

\G denotes "\R or \C", which means "this set can be replaced by either the set of all real numbers, or the set of all complex numbers".

### 3.1 Vector Space

First, let's define what is a vector space: a vector space is a set \V along with an addition on \V and scalar multiplication on \V such that the following properties hold:

1. commutativity: u + v = v + u \forall u, v \in \V.
2. associativity: (u + v) + w = v + (u + w) and (ab)v = a(bv) \forall u, v, w \in \V \forall a, b \in \G.
3. additive identity: \exists 0 \in \V \forall v \in \V: v + 0 = 0.
4. additive inverse: \exists w \in \V \forall v \in \V: v + w = 0.
5. multiplicative identity: 1v = v \forall v \in \V.
6. distributive properties: a(u+v)=au+av and (a+b)v=av+bv \forall a, b \in \G \forall u, v \in \V.

Notice that "addition on \V and scalar multiplication on \V" means a definition of addition and multiplication. Why do we need to define them? Addition and multiplication are well defined for real and complex numbers, but we can also have other things in our vector space: lists, functions, in fact any object from a field.

For our application, we define a vector space \W to be the set of all continuous real-valued functions. You can verify that \W does satisfy all the requirements for a vector space.

### 3.2 Basis

A basis of a vector space \V is a list of linearly independent vectors that spans \V. Every finite dimensional vector space has infinitely many choice of bases (except {0}), any infinite dimensional vector spaces does not have a basis. For example, P_n(x) is a finite dimensional vector space (you can verify), it has dimension dim \V = n + 1, one of its bases is (1, x, \cdots , x^n); the set of all continuous real-valued functions is an infinite dimensional vector space, and it does not have a basis, because there is no list can span the entire vector space (since lists have finite amount of elements).

Let's say we have a vector space \V, and we have basis v_1, \cdots, v_n. From the definition that v_1, \cdots, v_n spans \V, we can prove that for every vector v in \V, there exists a_1, \cdots, a_n \in \G such that v = a_1v_1 + \cdots + a_n v_n. In fact, a_1v_1 + \cdots + a_n v_n is a unique linear combination of v_1, \cdots, v_n for v, because v_1, \cdots, v_n are linearly independent. This is a very important property of basis, and we will utilize it heavily later.

Basis plays a crucial role in our application, it also defines some limitations. The fact infinite dimensional vector spaces don't have any basis, tells us our target function has to be defined within a finite dimensional vector space. We will explore more about this at later sections.

### 3.3 Inner Product Space

In the last section we defined what is a vector space, now let's go a step further to take a look at inner product space. An inner product space is a vector space with a defined inner product. So every inner product space is a vector space, but not all vector spaces are inner product space (lack of defined inner product).

So what is an inner product? One common misconception is that the dot product x*y=\sum_(i=1)^n x_iy_i is inner product. Although it is one of (infinitely) many inner products, not all inner products are dot product. Dot product is a special inner product called Euclidean inner product on vector spaces over \R; over \C, it is defined as x*y=\sum_(i=1)^n x_i \overline{y_i}. In fact, an inner product can be any function that satisfies some specific properties.

An inner product \langle\cdot,\cdot\rangle is a function that takes an ordered pair of vectors on a vector space \V, and maps it to \G, which also satisfies following properties:

1. positivity: \langle v, v\rangle >= 0 \forall v \in \V.
2. definiteness: \langle v, v\rangle = 0 \iff v = 0 \forall v \in \V.
3. additivity in first slot: \langle u + v, w\rangle = \langle u, w\rangle + \langle v, w\rangle \forall u, v, w \in \V.
4. homogeneity in first slot: \langle \lambda u, v\rangle = \lambda \langle u, v\rangle \forall u, v \in \V \forall \lambda \in \G.
5. conjugate symmetry: \langle u, v\rangle = \overline{\langle v, u\rangle} \forall u,v \in \V.

You can derive some other useful properties of inner product from above definition. (c) also gives you additivity in second slot; (d) and (f) gives you \langle u, \lambda v\rangle = \overline{\lambda} \langle u, v\rangle, etc. Now, as a practice you can verify Euclidean inner product is indeed a proper defined inner product.

On top of our vector space W, we turn it into an inner product space by defining \langle f, g \rangle = \int_{0}^{2\pi} f(x)g(x) dx. You can verify this definition satisfies all six requirements of inner product.

Inner product also defines a metric, every inner product space is also a metric space. For those of you more familiar with Hilbert space, inner product space is just a Hilbert space without the requirement of completeness. All inner product spaces we use in this article are also Hilbert spaces, but since completeness is not important to our application, inner product space properties should be good enough.

### 3.4 Norm

On top of inner product, we can define another property: norm. A simple way to think of norm is "length", norm represents some kind of "length" of the vector. One of the most common form of norm is l^2 norm, defined as the positive square root of the inner product of a vector with itself: \norm{v} = \sqrt{\langle v, v \rangle}. If the inner product is defined as the dot product, l^2 norm directly represents the length of vectors in \R^3 (and vectors in higher dimension which are hard to visually imagine) - just like how we calculate the length of a line in geometry.

From now on, all reference to norm in this article will be l^2 norm by default.

In the last section we mentioned that inner product defines a metric. Let's use inner product space \V = \R^3 with dot product as our example. If we have u, v \in \V, then the distance between them \norm{u - v} = \sqrt{\langle u-v, u-v \rangle}, expanding this we get \sqrt{(u_1-v_1)^2+(u_2-v_2)^2+(u_3-v_3)^2}, which is exactly the distance between two points on a three dimensional plane.

Defining dot product as inner product makes sense intuitively, because our metric matches what we expect in geometry. How do other definitions of inner product make sense? What would our metric be?

Let's look at our inner product space \W. The distance between two functions f, g \in \W is \norm{f - g} = \sqrt{\langle f-g, f-g \rangle}. Applying our inner product \int_{0}^{2\pi}f(x)g(x)dx, we have \norm{f - g} = \sqrt{\int_{0}^{2\pi} (f(x)-g(x))^2 dx}. From this equation, we can clearly see the metric defined by this inner product is root-square distance between f and g on domain [0, 2\pi], minimizing the value of norm{f - g} is mathematically equivalent of training a linear regression with least-square as cost function.

Back to our original problem, now we know how to formally define it: given function f \in \W, find a function g \in \U that is the closest approximation of f, with \U being a subspace of \W. Since we have a defined metric, all we have to do now is to find a way to minimize \norm{f - g}. The reset of this chapter will be dedicated to solve this problem.

### 3.5 Orthogonality

Let's say we have an inner product space \V = \R^2, with its inner product defined as dot product \langle u, v \rangle = u_1v_1+u_2v_2. What's the geometric meaning of this definition? Doing a bit of math, you can find out \langle u, v \rangle = \norm{u}\norm{v}cos(\theta), with \theta being the angle between u and v. So if \theta = \frac{\pi}{2} + k\pi \forall k \in \Z^\star, which means u is perpendicular to v, then \langle u, v \rangle = 0.

The word "perpendicular" is used to describe a property between two vectors; to describe a set of vectors that are pairwise perpendicular, we use the word "orthogonal".

In linear algebra, we say a set of vectors are orthogonal, if their pairwise inner products are 0 (\langle v_i,v_j \rangle = 0 \forall i \ne j).

Here are two examples of orthogonal vectors:

1. (2, 0, 0), (0, -1, 0), (0, 0, 3) is an orthogonal list of vectors in R^3, with inner product defined as dot product.
2. \frac{1}{\sqrt{2}}, \sqrt{\frac{3}{2}}x, -\frac{\sqrt{10}}{4} + \frac{3\sqrt{10}}{4}x^2 is an orthogonal list of functions in P_2(x), with inner product defined as \langle f, g \rangle = \int_{-1}^{1} f(x)g(x) dx. You can verify their pairwise inner products are 0.

### 3.6 Orthonormal Basis

In this section, we will talk about a special kind of basis, you can have a pretty good guess of what it is just by looking at the name.

An orthonormal basis is a basis consists an orthogonal list of vectors in an inner product space, with each vector has norm 1.

Here are two examples of orthonormal basis:

1. List of functions in example (b) from section 3.6 is an orthonormal basis of its inner product space. You can verify each of them has norm 1.
2. The standard basis (1, 0, 0), (0, 1, 0), (0, 0, 1) for R^3 is also an orthonormal basis, if inner product is defined as dot product.
3. 1/\sqrt{2\pi}, -\sqrt{\frac{3}{2\pi}} + \sqrt{\frac{3}{2\pi^3}}x is an orthonormal basis of P_1(x), with inner product defined as \langle f, g \rangle = \int_{0}^{2\pi} f(x)g(x) dx.

Why is orthonormal basis so important? Simply speaking, it makes many tasks a lot easier, one of them is projection. We will talk about projection soon, but before we do that, let's take a look at how to find an orthonormal basis for an inner product space.

### 3.7 Gram-Schmidt Procedure

Gram-Schmidt Procedure is an algorithm for converting a linearly independent list of vectors to an orthonormal list of vectors, and it makes sure the spanning space does not change. It was popularized by Danish mathematician Jørgen Gram (1850-1916) and German mathematician Erhard Schmidt (1876-1959).

Applying this algorithm on a basis of an inner product space will produce an orthonormal basis, because vectors in a basis are linearly independent, and the algorithm guarantees the spanning space does not change.

Gram-Schmidt Procedure:

Suppose v_1, \cdots , v_m is a linearly independent list of vectors in V. Let e_1 = \frac{v_1}{\norm{v_1}}. For j = 2, \cdots, m, define e_j inductively by

\begin{align} e_j = \frac{v_j - \langle v_j, e_1 \rangle e_1 - \cdots - \langle v_j, e_{j-1} \rangle e_{j-1}}{ \| v_j - \langle v_j, e_1 \rangle e_1 - \cdots - \langle v_j, e_{j-1} \rangle e_{j-1} \|} \end{align}

Then e_1, \cdots, e_m is an orthonormal list of vectors in V such that

\begin{align} span(v_1, \cdots, v_j) = span(e_1, \cdots, e_j), j = 1, \cdots, m \end{align}

Proof of Gram-Schmidt Procedure is lengthy, we won't go through it here. Intuitively, you can think of it as this: divide the first vector v_1 by its length, to get the first vector with unit norm e_1. Then on top of e_1, "decompose" the second vector v_2 to get another vector e_2 that's orthogonal to e_1, and divide e_2 by its length to make its norm 1, and so on. We use different vectors from the original list to make sure the spanning space does not change.

Applying this algorithm by hand is painful, especially when inner product is defined as integrals. So I wrote an Orthogonal Projection Calculator with Python. To get a list of its functionalities, simply download opc.py and run (make sure you have Python 3 and SymPy installed):

$python opc.py --help Make sure the Python version you're using is Python3. If you're sure you have Python3 and SymPy installed and the above command does not work for you, simply replace "python" with "python3" and try again. Now, let's find the orthonormal basis of P_6(x) with inner product defined as \langle f, g \rangle = \int_{-2\pi}^{3.5} f(x)g(x) dx using the calculator: $ python opc.py --orth-basis -2pi 3.5 6

The output will be something looks like this:

Orthonormal basis for inner product space of polynomials with degree 6, with inner product defined as
<f, g> = INTEGRATE f(x) * g(x) dx FROM -6.283185307179586 TO 3.5:

e1 = 0.3197126793588508
e2 = 0.15753691290748492 + 0.11320619759029203x
e3 = -0.2706616166519101 + 0.12473193974070092x + 0.04481625403056662x^2
e4 = -0.31227301452970124 - 0.1544232178048233x + 0.07542779614582375x^2 + 0.018067498883682663x^3
e5 = 0.09606656104589661 - 0.33935867491295924x - 0.06515869985164832x^2 + 0.040797180933917876x^3 + 0.007329224688826187x^4
e6 = 0.36754927122252556 + 0.0018270000230479434x - 0.25058703094572193x^2 - 0.021530454717053192x^3 + 0.020746148224762773x^4 + 0.0029816409523642845x^5
e7 = 0.11218360504319752 + 0.4906731193553822x - 0.07607781794486312x^2 - 0.15516757696621702x^3 - 0.0043500063723821295x^4 + 0.010143438648957032x^5 + 0.0012148476810352493x^6

Program finished execution.

Because of limited floating point precision on computers, this is only an approximation of the real orthonormal basis. However, this approximation is good enough for our purpose. If we verify their norm and orthogonality, the errors we get are usually very small, small enough that we can just ignore them.

Below is a plot of these polynomials on [-2\pi, 3.5], which is the domain of our inner product:

Here is a plot of the same polynomials on a expanded domain [-4\pi, 7]:

You can see that these orthonormal functions are unorganized up until the domain of our integral. That's because our requirement of orthonormal on that domain. We do not have control of these functions outside of this domain, nor do we have interest to do so. We only need to control what's inside of this domain.

Going back to our inner product space \W, we can find an orthonormal basis of P_6(x) with \langle f, g \rangle = \int_{0}^{2\pi} f(x)g(x) dx is:

e1 = 0.3989422804014327
e2 = -0.690988298942671 + 0.21994840679077274x
e3 = 0.8920620580763848 - 0.851856516525517x + 0.13557717541007888x^2
e4 = -1.055502061411253 + 2.015860446207437x - 0.8020853864933684x^2 + 0.08510390269479524x^3
e5 = 1.1968268412036078 - 3.809618156051251x + 2.728438023726422x^2 - 0.675491286393618x^3 + 0.0537538886225198x^4
e6 = -1.3231418572096527 + 6.317537009644209x - 7.038270702793955x^2 + 2.9871348617952975x^3 - 0.5348444388057381x^4 + 0.03404925448854949x^5
e7 = 1.438406849742725 - 9.615041523749234x + 15.302813853957973x^2 - 9.742073866560409x^3 + 2.9071860207987497x^4 - 0.40716986245449227x^5 + 0.021601032088794797x^6

Notice how this graph is very similar to the other one, but they are plots of orthonormal polynomials on different domains.

In the next section, we will take a look at how to use this orthonormal basis to solve our problem.

### 3.8 Orthogonal Projection

Like we concluded in section 3.5, the core piece of solving our puzzle is to find \min_{g \in \U} \norm{f-g} for given f \in \W. How can we minimize this distance?

Before we look at minimizing distance between functions, let's take a look at minimizing distance from a geometry aspect.

Let's say we have a point v = (x_1, y_1) on a two dimensional plane, and a line L that's represented by f(x) = ax, a \in \R. What's the point on L that's the closest to v? The answer is simple: draw a perpendicular line Q to L from v, the intersection of Q and L is the closest approximation of v on L. Let's call this intersection point u.

In this case, u is the orthogonal projection of v on L.

The definition of orthogonal projection is more generalized, you can use it on any inner product space:

Suppose \U is a finite-dimensional subspace of \V. The orthogonal projection of \V onto \U is the operator \P_\U \in \mathcal{L}(\V) defined as follows: For v \in \V, write v = u + w, where u \in \U and w \in \U^{\bot}. Then P_{\U}v = u.

\mathcal{L}(\V) denotes a linear operator on \V, a linear operator is a linear map which its codomain is its domain. \U^{\bot} denotes the orthogonal complement of \U, \U^{\bot} = \{v \in \V: \langle v, u \rangle = 0 \forall u \in \U \}. Notice the \P here denotes a linear map instead of a polynomial, it stands for "projection".

Intuitively, this definition means that every vector in v \in \V can be written as a unique combination of a vector in u \in \U and w \in \U^{\bot}, and u is the orthogonal projection of v on \U. You can also prove that P_U is a linear map, and there exists one and only one w \in \U^{\bot} that satisfies this definition. This proof requires knowledge of direct sum and linear map, so we won't go through them here.

If we have e_1, \cdots, e_m as an orthonormal basis of \U, finding \P_\U v is easy:

\begin{align} P_U v = \langle v, e_1 \rangle e_1 + \cdots + \langle v, e_m \rangle e_m \end{align}

To prove this equation is indeed an orthogonal projection, all we have to do is to make sure \langle w, u \rangle = 0 with w = v - u. To prove that, we need to prove \langle w, e_j \rangle = 0 \forall j = 1, \cdots, m, because u is a linear combination of e_1, \cdots, e_m

\begin{align} \langle w, e_j \rangle & = \langle v - u, e_j \rangle \tag{1} \\ & = \langle v - P_U v, e_j \rangle \tag{2} \\ & = \langle v - \langle v, e_1 \rangle e_1 - \cdots - \langle v, e_m \rangle e_m, e_j \rangle \tag{3} \\ & = \langle v, e_j \rangle - \langle \langle v, e_1 \rangle e_1 + \cdots + \langle v, e_m \rangle e_m , e_j \rangle \tag{4} \\ & = \langle v, e_j \rangle - \langle \langle v, e_j \rangle e_j , e_j \rangle \tag{5} \\ & = \langle v, e_j \rangle - \langle v, e_j \rangle \langle e_j , e_j \rangle \tag{6} \\ & = \langle v, e_j \rangle - \langle v, e_j \rangle \tag{7} \\ & = 0 \tag{8} \end{align}

We get from equation (3) to (4) using additivity in first slot of inner product; (4) to (5) using the fact that e_1, \cdots, e_m is an orthonormal basis, so their pairwise inner product is 0; (5) to (6) using homogeneity in first slot of inner product. You can find all of these definitions and properties from previous sections.

From a geometry point of view, you can think of each \langle v, e_j \rangle represents the "length" of v's projection on a "axis", and e_j represents the direction of this "axis" (because e_j has unit norm).

### 3.9 Applying Theory to Our Problem

We now have an inner product space \W with \langle f, g \rangle = \int_{0}^{2\pi}f(x)g(x)dx, a function F(x) \in \W we want to project from, a subspace \U = P_6(x) we want to project onto, and an orthonormal basis e_1, \cdots, e_m for U.

That's all we need for an orthogonal projection, our projected function will be the closest estimation of F(x) in a form of a degree 6 polynomial.

With some help from Orthogonal Projection Calculator, we can easily do this by a command line script:

python opc.py --approximate "-((-0.75*x**2 ... /(9*pi)" 0 2*pi 6 Below is the output: ------ Started Calculating Approximation ------ f(x) = -((-0.75*x^2 + 4.71239*x - 5.9022)*cos(x) + (0.0833333*x^2 - 0.523599*x + 0.803949)*cos(3*x) - 4.5*x + 1.5*x*sin(x) - 0.0555556*x*sin(3*x) - 2.46239*sin(x) + 0.424533*sin(3*x))/(9*pi) Orthonormal basis: e1 = 0.3989422804014327 e2 = -0.690988298942671 + 0.21994840679077274x e3 = 0.8920620580763848 - 0.851856516525517x + 0.13557717541007888x^2 e4 = -1.055502061411253 + 2.015860446207437x - 0.8020853864933684x^2 + 0.08510390269479524x^3 e5 = 1.1968268412036078 - 3.809618156051251x + 2.728438023726422x^2 - 0.675491286393618x^3 + 0.0537538886225198x^4 e6 = -1.3231418572096527 + 6.317537009644209x - 7.038270702793955x^2 + 2.9871348617952975x^3 - 0.5348444388057381x^4 + 0.03404925448854949x^5 e7 = 1.438406849742725 - 9.615041523749234x + 15.302813853957973x^2 - 9.742073866560409x^3 + 2.9071860207987497x^4 - 0.40716986245449227x^5 + 0.021601032088794797x^6 Projecting onto span(e1) --- 13.18s f1(x) = 0.6047933779077055 Projecting onto span(e1,...,e2) --- 15.28s f2(x) = 0.02602144120700545 + 0.18422882929757195x Projecting onto span(e1,...,e3) --- 12.70s f3(x) = 0.1376107349893199 + 0.07766890308803763x + 0.016959538991755004x^2 Projecting onto span(e1,...,e4) --- 14.53s f4(x) = 0.2246919837824912 - 0.08864403126454293x + 0.08313335299758272x^2 - 0.007021259734847447x^3 Projecting onto span(e1,...,e5) --- 22.05s f5(x) = 0.20019404249547249 - 0.01066466223648254x + 0.027284743817578543x^2 + 0.006805423711959009x^3 - 0.00110029250856299x^4 Projecting onto span(e1,...,e6) --- 16.53s f6(x) = 0.18479146267191837 + 0.0628772392226019x - 0.05464715617399502x^2 + 0.04157840206029123x^3 - 0.007326370322518855x^4 + 0.0003963644240674133x^5 Projecting onto span(e1,...,e7) --- 17.36s f7(x) = 0.15362699065053426 + 0.27119638940248647x - 0.3861973808447284x^2 + 0.2526498304944526x^3 - 0.07031336000652001x^4 + 0.009218092288255106x^5 - 0.00046800719857853306x^6 Duration: 111.64s ------ Finished Calculating Approximation ------ Program finished execution. From the output, we can see another advantages of using orthogonal projection instead of gradient based methods. For gradient based methods, if you set an upper limit for the degree of polynomial, you can only get one approximated function with that certain highest degree. However, since Gram-Schmidt Procedure is an inductive algorithm, we get the orthonormal bases for all the subspaces of U; when applying orthogonal projection, we get the most optimized function on every single subspace for free! If we want to do the same thing for gradient based method, we have to train the whole model again just to get another function, even if our new degree is less than the original degree. Notice although Gram-Schmidt Procedure is inductive, but the coefficients for approximated polynomials are not inductive. That's why for gradient based method, we can't get coefficients of the degree n polynomial from the degree n + 1 polynomial. Let's now plot these functions and see how well they matches with the original function. We can see that after f_3(x), most functions seems to be doing pretty well for approximating F(x). Although we originally picked a polynomial with degree 6, after plotting all the approximations out, we find a polynomial with degree 4 (highlighted in green) may just do the job. For reasons we will discuss later, we'd rather choose a lower degree polynomial than a higher one. Again, this is where orthogonal projection comes in handier than gradient based methods: functions in lower dimensional subspaces come for free. Here is a wider domain for reference, you can see these functions are not good estimates of F(x) outside of our original domain [0, 2\pi]: Now that we have found our desired function f_5(x), we will talk about how to implement it in an efficient way on our CUDA kernel in the next chapter. ## IV. Implementation This chapter may not be applicable for all kinds of function subspaces, but since we're using polynomials in this article, let's take a look at how to implement it efficiently on a CUDA kernel. In this chapter we will talk about some standard ways to represent a polynomial, and their different implementations. In the next chapter, we will see performance testing result of different implementations. In application, there may be more optimized implementation you can factor your function into, but that depends on the actual function and requires some creativity. ### 4.1 Polynomial by Coefficients The most straight forward representation would be using standard coefficients, this representation has three possible implementations on a CUDA kernel. We will use our polynomial f_5(x) as an example. \begin{align} f_5(x) \approx 0.2 - 0.011x + 0.027x^2 + 0.0068x^3 - 0.0011x^4 \end{align} The first implementation is to expand the equation: float const res{0.2f - 0.011f * x + 0.027f * x * x + 0.0068 * x * x * x - 0.0011 * x * x * x * x}; As you can see this implementation is inefficient. If we use n to represent the degree of polynomial, although this implementation has space complexity of O(1), its time complexity is O(n^2). The precise number of FPOs is \frac{n^2+3n}{2}. This is really bad. To solve the bad run time, let's inductively calculate the power of x's: float const x2{x * x}; float const x3{x2 * x}; float const x4{x3 * x}; float const res{0.2f - 0.011f * x + 0.027f * x2 + 0.0068 * x3 - 0.0011 * x4}; There are 3n - 1 number of FPOs in this implementation so the time complexity becomes O(n). However, notice that we had to store the power of x's in local variables, this is not good for a kernel. On the host, because of highly optimized register access and caching behavior, extra local variables in an ordinary function usually won't yield performance decrease. For the device, this is a big deal, because there are very limited amount of registers on a SM assigned to each block. This implementation has space complexity of O(n), and as we will see later, its performance is worse than alternative implementations. So far, seems like there is a trade off between time and space complexity. Is there a way we can have the best of both worlds? Let's take a look now. For each polynomial expressed in the form \begin{align} f(x) = a_0+a_1x+a_2x^2+\cdots+a_nx^n \end{align} we can convert it to another form \begin{align} f(x) = a_1'x^{k_1}(1 + \frac{a_2'}{a_1'}x^{k_2}(\cdots(1 + \frac{a_m'}{a_m' - 1}x^{k_m})\cdots)) \end{align} where a_j' is the jth nonzero coefficient, and k_j is the power of the x corresponding to the original coefficient a_j'. If the integration domain of your inner product definition is large, your approximated polynomials will likely have some coefficients being zeros. Since coefficients are constants, we can calculate our new coefficients beforehand and put them in the equation. Using Orthogonal Projection Calculator, we can calculate the nested form of a polynomial printed in code format:  python opc.py --print "0.2 - 0.011x + 0.027x^2 + 0.0068x^3 - 0.0011x^4" -nested -code
> 0.2 - 0.011 * x * (1 - 2.4545 * x * (1 + 0.2519 * x * (1 - 0.1618 * x)))

Our implementation then becomes:

float const res{0.2f - 0.011f * x * (1.0f - 2.4545f * x * (1.0f + 0.2519f * x * (1.0f - 0.1618f * x)))};

This implementation has the same number of FPOs as the last one - 3n - 1, so the time compelxity is O(n), and the space complexity is O(1). So far, this is our best solution.

### 4.2 Polynomial by Roots

Another way to represent a polynomial is using roots.

According to the fundamental theorem of algebra, every univariate polynomial with degree n can be split into n linear factors on the complex plane. That is:

\begin{align} \forall p \in P(x) \exists c, r_1, \cdots, r_n \in C: p(x) = c(x - r_1)(x - r_2)\cdots(x - r_n) \end{align}

where r_1, \cdots, r_n are the complex roots of p..

For example, our polynomial f_5(x) can be factorized into:

\begin{align} f_5(x) \approx -0.0011(x - 9.0517)(x + 3.8958)(x - (0.5146 + 2.2124i))(x - (0.5146 - 2.2124i)) \end{align}

This is the irreducible factorization of f_5(x) over C, because you cannot possibly factorize it any further with complex numbers.

It looks like the implementation of this form has 2n arithmetic operations and a space complexity of O(1). Unfortunately, since this factorization is only guaranteed to work on complex plane, so in worst case each addition and subtraction costs 2 FPOs, and each multiplication costs 6 FPOs.

However, that won't be a problem for us. Looking at the equation, there are only two complex numbers that are the complex conjugate of each other. In fact, this is guaranteed to happen. According to complex conjugate root theorem, complex roots of univariate polynomials with real coefficients always appear in pairs (each other's complex conjugate). We also know that x can only be real numbers. With these limitations, we can rewrite this factorization into:

\begin{align} f_5(x) & \approx -0.0011(x - 9.0517)(x + 3.8958)((x - 0.5146)^2 + 2.2124^2) \\ & \approx -0.0011(x - 9.0517)(x + 3.8958)(x(x - 0.5146) + 5.1595) \end{align}

This is the irreducible factorization of f_5(x) over R, you cannot factorize it any further with real numbers.

Below is the implementation of this factorization:

float const res{-0.011f * (x - 9.0517f) * (x + 3.8958f) * (x * (x - 0.5146f) + 5.1595f)};

As you can see, this implementation only costs 8 FPOs, comparing to 11 FPOs for the nested coefficients form. Although 3 FPOs doesn't seem like a lot for host, but it does make a difference on a CUDA kernel, as we will see in the next chapter.

Looking back, we have successfully reduced 29 FPOs and 6 special function calls to only 8 FPOs!

In the next chapter we will put our work in test, see how much this can speed up our CUDA kernel.

In application, based on the actual function, there are different ways to implement a function on a CUDA kernel. You should adjust your implementation according to the situation instead of sticking with a "standard implementation".

### 4.3 Numerical Stability

Like we stated before, we prefer lower degree polynomials rather than higher degree ones. One of the reason is the higher the degree, the more FPOs it costs. The other reason is that although higher degree polynomials are more accurate predictions in theory, because of the limited precision of floating point numbers on the computer, the benefit is averaged out by the fact higher degree polynomials are numerically less stable. Blindly increase the complexity of your function is not a good idea, if you do, make sure the function does not add extra computational cost and it's relatively stable numerically.

Below is an example of incorrect result in Python caused by precision loss:

# Python 3.6.1 on macOS with GCC 4.2.1

4.8/3 is 1.6
> False

4.8/3
> 1.5999999999999999

Numerical stability is an important subject of study in numerical analysis. Another example of this is the Jordan form of a matrix in linear algebra.

In linear algebra, it's always easier to deal with matrices with many zeros than the ones full of nonzero numbers. So for every matrix, there exists a form with different basis, that is an upper triangular block diagonal matrix. Not only matrices in this form has many zeros, but also it has many interesting properties related to the eigenspaces (both geometric and algebraic). Unfortunately, comparing to other form of matrices, implementations of Jordan form tend to be numerically unstable, because it's very hard to distinguish if two eigenvalues are the same. Because of this reason, Jordan form is usually avoided in practice.

## V. Performance Testing Result

Now is the time to put our theory to test, see if it actually helps for speeding up our kernels. The machine I used for testing is a custom built PC, with the following configurations:

• GPU: NVIDIA Titan X (Pascal) 12 GB
• CPU: Intel Core i7-6850K, 6-core 3.8 GHz
• RAM: 64 GB 2400 Mhz DDR4
• OS: Ubuntu 16.04 LTS
• GPU driver version: 375.39
• CUDA version: 8.0.44

I will only use nvprof for performance measuring, since it's the most accurate way to do such a thing. To get the raw nvprof output, checkout this text file on the project's GitHub repository.

### 5.1 SFU vs. FPO

First, let's see if special function calls are really slower than FPOs. To test pure arithmetic operation, I avoided memory access (except local registers) in the testing kernels. Each kernel contains some amount of FPO, or special function calls. We test kernels with different number of arithmetic operations to see how does the number of FPOs and special function calls on each thread affects the performance. Below is a sample implementation.

__global__ void FPO_3() {
// Contains 3 floating point operations
float const res{1.0f * 1.5f + 2.0f * 8.0f};
}

__global__ void SFU_3() {
// Contains 3 special function calls
float const res{__sinf(__cosf(__sinf(2.0f)))};
}

Plotting out the the relationship between the number of arithmetic operations and execution duration of each kernel from our testing result, we get:

From the plot above, we can see special function calls are indeed slower than FPOs. Because of cores on a SM are very powerful, the duration to calculate arbitrary number of FPOs is almost linear; however, because of the limited amount of SFUs on an SM, duration for special function calls grows in close to linear factor.

In our original function, we had 29 FPOs and 6 special function calls; in our approximated function, we only have 8 FPOs. In theory, just the special function calls alone in the original function, takes almost twice as much time as the second function.

### 5.2 Original vs. Approximated Function

Now let's take a look at our how our actual kernels perform.

To simulate computationally more intensive situations, I implemented 4 kernels for each variation of implementation (including the original function), with each kernel containing different amount of repeated device function call. The implementation of theses kernels is in this folder on the project repository.

Below is the plot of testing result.

Obviously, our approximated functions are a lot faster than the original function. We can also see that the more computationally intensive the original kernel is, the more advantage alternative implementations have. Let's take a closer look at how different implementations of our polynomial perform:

As we expected, expressing polynomial by roots is the most efficient implementation, this advantage also increases along with the amount of computation each kernel performs.

For the single integral (2 device function calls) case, the most optimized solution (polynomial by roots) is 24.4% faster than the original implementation; for the case where we're calculating 4 integrals (8 device function calls), it's almost 242.4% faster! This number will only increase if more calculation is performed on each kernel.

## VI. Challenge

As we saw in the last section, our method turns out to work well. However, like we stated at the beginning of this article, this is not a generic algorithm that works in any situation. In this chapter we will discuss the challenges of applying this algorithm.

### 6.1 Limited Set of Functions

Because we're using a basis of an inner product space, the only possible combination of functions in the basis is linear combination. This limits how we want to express our function. For example, if we have (1, x, sin(x)) as our original basis, it's only possible to get functions in the form of a_0 + a_1x + a_2sin(x) where a_0, \cdots, a_2 \in G. If we want more flexible combinations like a_0 + a_1x + a_2x^2 + a_3sin^2(x) + a_4sin(x^2), we will have to manually expand our basis with (x^2, sin^2(x), sin(x^2)).

### 6.2 Arithmetic Operation Intensity

As we already saw from the testing result, the more computationally intensive the kernel is, the more performance benefit our kernels gain. Sometimes the original kernel is already very simple, so it's not worth the time to go through all of this to gain only marginal performance improvement; on the other hand, if the original kernel is very complicated, it's hard to find a simple good approximation, it's possible that the approximations cost more than the original function.

Coming up with an appropriate basis for the target inner product space to project on is not always easy. For this case we discussed in the article, this task is simple and straight forward, but depending on the application, creativity and very mature math skills may be required.

Another factor to consider here is memory access time. If the kernel's memory access time takes way more than arithmetic operation, no matter how optimized the kernel is in terms of arithmetic operation, it's only able to gain marginal performance benefits.

### 6.3 Hardware

Hardware also plays an important factor in how well our approximation performs. Different hardwares have different strength and weaknesses, some newer GPUs are very optimized for special function calls, older ones are less efficient in that regard. In fact, the GPU we're using here is a relatively new one at the time of this writing, it's NVIDIA's best consumer facing deep learning card in 2016, so special function calls already perform better comparing to some older GPUs.

I also tested this algorithm on a NVIDIA GT 750M on my MacBook Pro, the performance gain is greater than this Titan X Pascal we used. However, for some reason it's hard to product consistent testing results, so I did not use that as a metric in the article.

My best guess for the cause of this inconsistency would be performance throttling because of the increased temperature, since MacBook Pro uses very limited air cooling system, but it's hard to find proofs for my claim. I use water cooling for my Titan X, its maximum temperature is usually below 45 degree Celsius even under long period of heavy use, so throttling was not a problem during performance testing.

## VII. Conclusion

After working through our problem and doing some performance testing, we can see that orthogonal projection is a reliable way to approximate complicated mathematic functions with a simpler one, which can dramatically speed up parallel computing on certain platforms.

Application of this method is beyond what we discussed in this article, this is only a proof of concept. With creative thinking and mature math skills, one can easily implement this idea in a more complicated situation to gain great performance benefits.