Most efficient way to build block / banded SparseArray

This is definitely faster. It took me some time to figure out the combinatorics and there might still be some potential for improvement within the compiled functions.

getColumnIndices = Compile[{{m, _Integer}, {n, _Integer}, {i, _Integer}},
   Transpose[Partition[
     Partition[
      Join[
       Table[k, {k, 1, i n}],
       Flatten[Table[Table[k, {j, 1, n}], {k, i n + 1, i n + n}]],
       Table[k, {k, i n + n + 1, m n}]
       ], {1}], {n}]],
   RuntimeAttributes -> {Listable},
   Parallelization -> True,
   CompilationTarget -> "C",
   RuntimeOptions -> "Speed"
   ];

getValues = Compile[{{k1row, _Real, 1}, {k2, _Real, 2}, {i, _Integer}},
   Block[{A},
    A = Join[
      Table[Compile`GetElement[k1row, k], {l, 1, Length[k2]}, {k, 1, i}],
      k2,
      Table[Compile`GetElement[k1row, k], {l, 1, Length[k2]}, {k, i + 2, Length[k1row]}],
      2
      ];
    Do[A[[k, k + i]] += Compile`GetElement[k1row, i + 1], {k, 1, Length[k2]}];
    A
    ],
   RuntimeAttributes -> {Listable},
   Parallelization -> True,
   CompilationTarget -> "C",
   RuntimeOptions -> "Speed"
   ];

makeKMat3[k1_, k2_] := With[{m = Length[k1], n = Length[k2]},
  SparseArray @@ {Automatic, {m n, m n}, 0, {1, {
      Range[0, (m n ) (m + n - 1), m + n - 1],
      Flatten[getColumnIndices[m, n, Range[0, m - 1]], 2]
      },
     Flatten[getValues[k1, k2, Range[0, m - 1]]]
     }}
  ]

Just to give you an idea of the timings:

m = 100;
n = 6;
k1 = RandomReal[{-1, 1}, {m, m}];
k2 = RandomReal[{-1, 1}, {n, n}];
A = makeKMat2[k1, k2]; // RepeatedTiming
B = makeKMat3[k1, k2]; // RepeatedTiming
Max[Abs[A - B]]

{0.11, Null}

{0.000481, Null}

0.

Addendum

The idea for the "higher dimensional" case is similar to the idea above. I exploit that the "ColumnIndices" of the resulting matrices (when partitioned into column indices per row) stay rectangular so that column indices from the diagonal matrices can be joined to it from the left and right. Some extra index list (diagidx) is needed for book keeping of those column indices of the blocks that belong to the diagonal entries. All in all, only operations on the nonzero values and the column indices are performed. No SparseArrays are built intermediately: Even building a SparseArray from row pointers, column indices and nonzero values has still some considerable overhead, probably because there is a consistency checker involved in the backend. That's a pitty since I know that I produce consistent data. This issue is also closely related to this post by Szabolcs.

getValues2 = Compile[{{k1row, _Real, 1}, {blockvals, _Real, 2}, {diagidx, _Integer, 1}, {i, _Integer}},
   Block[{A},
    A = Join[
      Table[Compile`GetElement[k1row, k], {l, 1, Dimensions[blockvals][[1]]}, {k, 1, i}],
      blockvals,
      Table[Compile`GetElement[k1row, k], {l, 1, Dimensions[blockvals][[1]]}, {k, i + 2, Length[k1row]}],
      2];
    Do[A[[k, Compile`GetElement[diagidx, k] + i]] += Compile`GetElement[k1row, i + 1], {k, 1, Length[blockvals]}];
    A
    ],
   RuntimeAttributes -> {Listable},
   Parallelization -> True,
   CompilationTarget -> "C",
   RuntimeOptions -> "Speed"
   ];

getColumnIndices2 = Compile[{
    {blockci, _Integer, 3}, {diagci, _Integer, 
     3}, {m, _Integer}, {n, _Integer}, {i, _Integer}
    },
   If[i > 0,
    If[i < m - 1,
     Join[diagci[[All, 1 ;; i]], blockci + i n, diagci[[All, i + 1 ;; m - 1]] + (n), 2],
     Join[diagci[[All, 1 ;; i]], blockci + i n, 2]
     ],
    Join[blockci + i n, diagci[[All, i + 1 ;; m - 1]] + (n), 2]
    ],
   RuntimeAttributes -> {Listable},
   Parallelization -> True,
   CompilationTarget -> "C",
   RuntimeOptions -> "Speed"
   ];

toSparseArrayData[b_?MatrixQ] := {
  Partition[SparseArray[b]["ColumnIndices"], Dimensions[b][[2]]],
  b,
  Dimensions[b][[2]],
  Range[Dimensions[b][[2]]]
  }

toSparseArray[X_] := 
 With[{d1 = Dimensions[X[[1]]][[1]], d2 = Dimensions[X[[1]]][[2]]},
  SparseArray @@ {Automatic, {d1, d1}, 0,
    {1, {Range[0, d1 d2, d2], Flatten[X[[1]], 1]}, Flatten[X[[2]]]}}
  ]

iteration[X_, a_] := With[{
   m = Length[a],
   blockci = X[[1]],
   blockvals = X[[2]],
   n = X[[3]],
   diagidx = X[[4]]
   },
  With[{ran = Range[0, m - 1]},
   {
    Join @@ getColumnIndices2[blockci, Transpose[Partition[Partition[Range[(m - 1) n], 1], n]], m, n, ran],
    Join @@ getValues2[a, blockvals, diagidx, ran],
    m n,
    Join @@ Outer[Plus, ran, diagidx]
    }
   ]
  ]

makeKMat3ND[ks : {__List}] := 
 toSparseArray[Fold[iteration, toSparseArrayData[ks[[1]]], Rest[ks]]]

Usage example and timing test:

SeedRandom[123];
ks = Table[RandomReal[{-1, 1}, {RandomInteger[{3, 8}]}[[{1, 1}]]], 6];
A = makeKMat2ND[Reverse@ks]; // AbsoluteTiming
B = makeKMat3ND[ks]; // AbsoluteTiming
Max[Abs[A - B]]

{24.0011, Null}

{0.038756, Null}

8.88178*10^-16

This is the sparsity pattern of the resulting matrix:

enter image description here


Not faster, but maybe easier to implement

makeKMat2[k1_, k2_] := Module[{A, m, n},
  m = Length[k1];
  n = Length[k2];
  A = Map[SparseArray[Band[{1, 1}] -> #, {n, n}, 0.] &, k1, {2}];
  With[{B = SparseArray@k2}, Do[A[[i, i]] += B, {i, 1, m}]];
  ArrayFlatten[A]
  ]

Your implementation needs about 0.0048s; mine needs 0.0051s on my machine.

(b3m2a1 comment)
Note that this has a simple extension to the N-dimensional case:

makeKMat2ND[ks : {__List}] := Fold[makeKMat2, ks]

This is slightly faster and certainly more compact than the uncompiled implementation in Henrik's answer:

makeKMat[k1_?SquareMatrixQ, k2_?SquareMatrixQ] := Module[{m = Length[k1], n = Length[k2]},
    KroneckerProduct[k1, IdentityMatrix[n, SparseArray]] + 
    KroneckerProduct[IdentityMatrix[m, SparseArray], k2]]

and as b3m2a1 notes, the extension can be done as

makeKMatND[ks : {__?SquareMatrixQ}] := Fold[makeKMat[#2, #1] &, ks]

Using the examples in the OP:

MatrixPlot[makeKMat[ConstantArray[1, {9, 9}], ConstantArray[2, {6, 6}]]]

matrix plot of 2D case

AbsoluteTiming[tst =
               makeKMatND[{ConstantArray[4, {5, 5}], ConstantArray[3, {4, 4}],
                           ConstantArray[2, {3, 3}], ConstantArray[1, {2, 2}]}];][[1]]
   0.000406914

MatrixPlot[tst, MaxPlotPoints -> Infinity]

matrix plot of 4D case