How to efficiently compute the partial trace of a matrix?

Looks like you can just use TensorContract/TensorProduct:

TensorContract[
    TensorProduct[Array[a,{2,2}],Array[b,{2,2}],Array[c,{2,2}]],
    {3,4}
] //ArrayFlatten //TeXForm

$\tiny \begin{pmatrix} a(1,1) b(1,1) c(1,1)+a(1,1) b(2,2) c(1,1) & a(1,1) b(1,1) c(1,2)+a(1,1) b(2,2) c(1,2) & a(1,2) b(1,1) c(1,1)+a(1,2) b(2,2) c(1,1) & a(1,2) b(1,1) c(1,2)+a(1,2) b(2,2) c(1,2) \\ a(1,1) b(1,1) c(2,1)+a(1,1) b(2,2) c(2,1) & a(1,1) b(1,1) c(2,2)+a(1,1) b(2,2) c(2,2) & a(1,2) b(1,1) c(2,1)+a(1,2) b(2,2) c(2,1) & a(1,2) b(1,1) c(2,2)+a(1,2) b(2,2) c(2,2) \\ a(2,1) b(1,1) c(1,1)+a(2,1) b(2,2) c(1,1) & a(2,1) b(1,1) c(1,2)+a(2,1) b(2,2) c(1,2) & a(2,2) b(1,1) c(1,1)+a(2,2) b(2,2) c(1,1) & a(2,2) b(1,1) c(1,2)+a(2,2) b(2,2) c(1,2) \\ a(2,1) b(1,1) c(2,1)+a(2,1) b(2,2) c(2,1) & a(2,1) b(1,1) c(2,2)+a(2,1) b(2,2) c(2,2) & a(2,2) b(1,1) c(2,1)+a(2,2) b(2,2) c(2,1) & a(2,2) b(1,1) c(2,2)+a(2,2) b(2,2) c(2,2) \\ \end{pmatrix}$

You can look at Ways to compute inner products of tensors and in particular this answer for an efficient version of this approach.

The OP wanted a method to convert a KroneckerProduct representation to a TensorProduct representation so that TensorContract could be used. For this particular example, you could use Nest and Partition to do this. Here I use this method on a random 1000 x 1000 matrix:

TensorContract[
    Nest[Partition[#, {10, 10}]&, RandomReal[1, {1000, 1000}], 2],
    {3,4}
]

Another possibility is to use ArrayReshape, although a little massaging is necessary for this approach:

r1 = Transpose[
    ArrayReshape[data, {10,10,10,10,10,10}],
    {1,3,5,2,4,6}
]; //AbsoluteTiming
r2 = Nest[Partition[#, {10,10}]&, data, 2]; //AbsoluteTiming
r1===r2

{0.005594, Null}

{0.0286, Null}

True

To convert back you would use ArrayFlatten or Flatten.

To wrap it all up into a function:

partialTrace[matrix_, lengths_, indicesToKeep_] := With[{
   indicesToTrace = Complement[Range@Length@lengths, indicesToKeep]
   },
  With[{
    matrixInTPForm = Transpose[
      ArrayReshape[matrix, Join[lengths, lengths]],
      Join @@ Transpose@Partition[Range[2 Length@lengths], 2]
      ]
    },
   Flatten[
    TensorContract[matrixInTPForm,
     {2 # - 1, 2 #} & /@ indicesToTrace
     ],
    Transpose@
     Partition[Range[2 Length@lengths - 2 Length@indicesToTrace], 2]
    ]
   ]
  ]

A less general solution for the case of spin chains and sparse arrays

I faced the problem of finding reduced density matrices for large spin systems, where the Hamiltonian is represented as a $2^N$ by $2^N$ matrix and state vectors are represented as vectors of length $2^N$. You could also have $D$ states per site for a state space of $D^N$, but I'll use $D=2$ as an example.

I found a way to tackle the problem using SparseArrays and Kronecker products. This method is very fast, but I have a feeling working with TensorProduct and TensorContract from the get-go is still faster. This is just a second way to tackle the problem.

The idea is that we can evaluate the density matrix elements as follows: \begin{align*} \langle n|\rho_A|m\rangle &= \langle n|\operatorname{tr}_B\left( \rho_{AB}\right) |m\rangle\\ &=\operatorname{tr}_B\left(\langle n|\rho_{ab}|m\rangle\right)\\ &=\operatorname{tr}_{AB}\left(\rho_{ab}|m\rangle\langle n|\right) \end{align*}

So, if you have $D=2$ and $N=$nsites, the partial trace of a single site after tracing over everything else can be calculated as follows:

id = SparseArray@IdentityMatrix[2]; 
(* The four matrices |0><0|, |1><1|, |1><0| and |0><1|. *)

projector[0, 0] = SparseArray@{{1, 0}, {0, 0}};
projector[1, 1] = SparseArray@{{0, 0}, {0, 1}};
projector[1, 0] = SparseArray@{{0, 0}, {1, 0}};
projector[0, 1] = Transpose[projector[1, 0]];
(* Sparse array kronecker product *)

myProjector[s1_, s2_, site_] := 
   KroneckerProduct @@ 
    Table[If[i == site, projector[s1, s2], id], {i, 1, nsites}];
myPartialTrace[rho_, site_] := 
  Table[Tr[rho.myProjector[s1, s2, site]], {s1, 0, 1}, {s2, 0, 1}];

The sparse kronecker product is calculated in no time at all (even for N=15), so all of the time is being spend inside the trace and matrix multiplication routines.

Here is a full sample implementation showing that I get the same results as partialTrace. I threw in an AbsoluteTiming call, but because I convert the large matrix into a dense matrix with Normal before passing it to partialTrace, the time comparisons aren't fair (I would be comparing a sparse problem to a dense problem).

partialTrace[matrix_, lengths_, indicesToKeep_] := 
  With[{indicesToTrace = 
     Complement[Range@Length@lengths, indicesToKeep]}, 
   With[{matrixInTPForm = 
      Transpose[ArrayReshape[matrix, Join[lengths, lengths]], 
       Join @@ Transpose@Partition[Range[2 Length@lengths], 2]]}, 
    Flatten[TensorContract[
      matrixInTPForm, {2 # - 1, 2 #} & /@ indicesToTrace], 
     Transpose@
      Partition[Range[2 Length@lengths - 2 Length@indicesToTrace], 
       2]]]];

id = SparseArray@IdentityMatrix[2]; 
(* The four matrices |0><0|, |1><1|, |1><0| and |0><1|. *)

projector[0, 0] = SparseArray@{{1, 0}, {0, 0}};
 projector[1, 1] = SparseArray@{{0, 0}, {0, 1}};
projector[1, 0] = SparseArray@{{0, 0}, {1, 0}};
projector[0, 1] = Transpose[projector[1, 0]];

timings = Table[
   nupspins = Floor[nsites/2];
   Clear[myProjector];
   (* Sparse array kronecker product *)

   myProjector[s1_, s2_, site_] := 
    KroneckerProduct @@ 
     Table[If[i == site, projector[s1, s2], id], {i, 1, nsites}];
   myPartialTrace[rho_, site_] := 
    Table[Tr[rho.myProjector[s1, s2, site]], {s1, 0, 1}, {s2, 0, 1}];

   (* Create an interesting state to work with. 
   This one is a random superposition of all states such that the \
total number of up spins is 1. Sites are indexed in binary, 
   so basis states with nsites=10 and nupspins=
   5 would include 0b0000011111, 0b1010101010, etc. *)

   normalize[vec_] := vec/Sqrt[vec.vec];
   myArray = 
    normalize[
     SparseArray[
      ReleaseHold[
       Table[If[Total[IntegerDigits[i, 2]] == nupspins, 
         i + 1 -> RandomReal[], HoldForm[Sequence[]]], {i, 0, 
         2^nsites - 1}]], {2^nsites}]];

   {AbsoluteTiming[
     partialTrace[Normal@Outer[Times, myArray, myArray], 
      Table[2, {i, 1, nsites}], {1}]],
    AbsoluteTiming[
     myPartialTrace[Outer[Times, myArray, myArray], 1]]}
   , {nsites, 2, 14}];

(* ensure that for all the calculated partial traces, the difference \
is essentially zero. *)
And @@ 
 Thread[Chop[Flatten[timings[[All, 1, 2]] - timings[[All, 2, 2]]]] == 
   0]