# Optimizing the backward solve for a sparse lower triangular linear system

This should depend quite a bit on the exact sparsity pattern of the matrix and the platform being used. I tested a few things with `gcc 8.3.0`

and compiler flags `-O3 -march=native`

(which is `-march=skylake`

on my CPU) on the lower triangle of this matrix of dimension 3006 with 19554 nonzero entries. Hopefully this is somewhat close to your setup, but in any case I hope these can give you an idea of where to start.

For timing I used google/benchmark with this source file. It defines `benchBacksolveBaseline`

which benchmarks the implementation given in the question and `benchBacksolveOptimized`

which benchmarks the proposed "optimized" implementations. There is also `benchFillRhs`

which separately benchmarks the function that is used in both to generate some not completely trivial values for the right hand side. To get the time of the "pure" backsolves, the time that `benchFillRhs`

takes should be subtracted.

## 1. Iterating strictly backwards

The outer loop in your implementation iterates through the columns backwards, while the inner loop iterates through the current column forwards. Seems like it would be more consistent to iterate through each column backwards as well:

```
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
```

This barely changes the assembly (https://godbolt.org/z/CBZAT5), but the benchmark timings show a measureable improvement:

```
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2734 ns 5120000
benchBacksolveBaseline 17412 ns 17421 ns 829630
benchBacksolveOptimized 16046 ns 16040 ns 853333
```

I assume this is caused by somehow more predictable cache access, but I did not look into it much further.

## 2. Less loads/stores in inner loop

As A is lower triangular, we have `i < Li[j]`

. Therefore we know that `x[Li[j]]`

will not change due to the changes to `x[i]`

in the inner loop. We can put this knowledge into our implementation by using a temporary variable:

```
for (int i=n-1; i>=0; --i) {
double xi_temp = x[i];
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
xi_temp -= Lx[j] * x[Li[j]];
}
x[i] = xi_temp;
}
```

This makes `gcc 8.3.0`

move the store to memory from inside the inner loop to directly after its end (https://godbolt.org/z/vM4gPD). The benchmark for the test matrix on my system shows a small improvement:

```
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2740 ns 5120000
benchBacksolveBaseline 17410 ns 17418 ns 814545
benchBacksolveOptimized 15155 ns 15147 ns 887129
```

## 3. Unroll the loop

While `clang`

already starts unrolling the loop after the first suggested code change, `gcc 8.3.0`

still has not. So let's give that a try by additionally passing `-funroll-loops`

.

```
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2733 ns 2734 ns 5120000
benchBacksolveBaseline 15079 ns 15081 ns 953191
benchBacksolveOptimized 14392 ns 14385 ns 963441
```

Note that the baseline also improves, as the loop in that implementation is also unrolled. Our optimized version also benefits a bit from loop unrolling, but maybe not as much as we may have liked. Looking into the generated assembly (https://godbolt.org/z/_LJC5f), it seems like `gcc`

might have gone a little far with 8 unrolls. For my setup, I can in fact do a little better with just one simple manual unroll. So drop the flag `-funroll-loops`

again and implement the unrolling with something like this:

```
for (int i=n-1; i>=0; --i) {
const int col_begin = Lp[i];
const int col_end = Lp[i+1];
const bool is_col_nnz_odd = (col_end - col_begin) & 1;
double xi_temp = x[i];
int j = col_end - 1;
if (is_col_nnz_odd) {
xi_temp -= Lx[j] * x[Li[j]];
--j;
}
for (; j >= col_begin; j -= 2) {
xi_temp -= Lx[j - 0] * x[Li[j - 0]] +
Lx[j - 1] * x[Li[j - 1]];
}
x[i] = xi_temp;
}
```

With that I measure:

```
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2728 ns 2729 ns 5090909
benchBacksolveBaseline 17451 ns 17449 ns 822018
benchBacksolveOptimized 13440 ns 13443 ns 1018182
```

## Other algorithms

All of these versions still use the same simple implementation of the backward solve on the sparse matrix structure. Inherently, operating on sparse matrix structures like these can have significant problems with memory traffic. At least for matrix factorizations, there are more sophisticated methods, that operate on dense submatrices that are assembled from the sparse structure. Examples are supernodal and multifrontal methods. I am a bit fuzzy on this, but I think that such methods will also apply this idea to layout and use dense matrix operations for lower triangular backwards solves (for example for Cholesky-type factorizations). So it might be worth to look into those kind of methods, if you are not forced to stick to the simple method that works on the sparse structure directly. See for example this survey by Davis.