Time-efficient matrix elements grouping and summing

Partition[test, {2,3}] is quite slow in this case because it has to rearrange the elements in the data vector that represents the entries of a packed array in the backend:

Flatten[test] == Flatten[Partition[test, {2, 3}]]

False

Using Span (;;) as follows employs 6 monotonically increasing read operations; in this specific case, these operations are faster than using Partition:

n = 24000;
test = RandomInteger[1, {n, n}];

a = Total[Partition[test, {2, 3}], {3, 4}]; // 
  AbsoluteTiming // First
b = Sum[test[[i ;; ;; 2, j ;; ;; 3]], {i, 1, 2}, {j, 1, 3}]; // 
  AbsoluteTiming // First
a == b

245.89

117.943

True

However, this performance advantage seems to decay when the matrix test becomes bigger (so swapping is required). E.g., for $n = 4800$, method b is aboutten times faster thana`, but for $n = 24000$, it's only a factor of 4.6 and here it has degraded to a factor of 2 or so...

SparseArray method

Have I said already that I love SparseArrays?

AbsoluteTiming[
  c = Dot[
    KroneckerProduct[
     IdentityMatrix[n/2, SparseArray], 
     ConstantArray[1, {1, 2}]
     ],
    Dot[
     test,
     KroneckerProduct[
      IdentityMatrix[n/3, SparseArray], 
      ConstantArray[1, {3, 1}]
      ]
     ]
    ]
  ][[1]]
a == c

76.3822

True

The story goes on...

A combination of the SparseArray method from above with a CompiledFunction):

cf = Compile[{{x, _Integer, 1}, {k, _Integer}},
   Table[
    Sum[Compile`GetElement[x, i + j], {j, 1, k}],
    {i, 0, Length[x] - 1, k}],
   CompilationTarget -> "C",
   RuntimeAttributes -> {Listable},
   Parallelization -> True,
   RuntimeOptions -> "Speed"
   ];
d = KroneckerProduct[
     IdentityMatrix[n/2, SparseArray], 
     ConstantArray[1, {1, 2}]
     ].cf[test, 3]; // AbsoluteTiming // First
a == d

33.5677

True