Can someone boost even more my code?

Exploitation of low rank structure

The ordering of summation/dot products is crucial here. As aooiiii pointed out, mat2 has a low-rank tensor product structure. So by changing the order of summation/dotting operations, we can make sure that this beast is never assembled explicitly. A good rule of thumb is to sum intermediate results as early as possible. This reduces the number of flops and, often more importantly, the amount of memory that has to be shoved around by the machine. As a simple example consider the sum over all entries of the outer product of two vector x = {x1,x2,x3} and y = {y1,y2,y3}: First forming the outer product requires $9 = 3 \times 3$ multiplications and summing all entries requires $8 = 3 \times 3 -1$ additions.

 Total[KroneckerProduct[x, y], 2]

x1 y1 + x2 y1 + x3 y1 + x1 y2 + x2 y2 + x3 y2 + x1 y3 + x2 y3 + x3 y3

However first summing the vectors and then multiplying requires only $4 = 2 \times (3-1)$ additions and one multiplication:

 Total[x] Total[y]

(x1 + x2 + x3) (y1 + y2 + y3)

For vectors of length $n$, this would be $2 n^2 -1$ floating point operations in the first case vs. $2 (n -1) +1$ in the second case. Moreover, the intermediate matrix requires $n^2$ additional units of memory while storing $x$ and $y$ can be done with only $2 n$ units of memory.

Side note: In the "old days" before FMA (fused multiply-add) instructions took over, CPUs had separate circuits for addition and multiplication. On such machines, multiplication was more expensive than addition and thus this optimization is particularly striking. (My current computer, a Haswell (2014), has still a pure addition circuit, so those days are not that old...)

Code

Further speed-up can be obtained by using packed arrays throughout and by replacing all occurrences of Table in high-level code either by vectorized operations or compiled code.

This part of the code needs to be executed only once:

Needs["NumericalDifferentialEquationAnalysis`"];
nof = 30;
a = 1.;
b = 1.;
{xi, wix} = Transpose[Developer`ToPackedArray[GaussianQuadratureWeights[nof, 0, a]]];
{yi, wiy} = Transpose[Developer`ToPackedArray[GaussianQuadratureWeights[nof, 0, b]]];

First@RepeatedTiming[
  Module[{m = N[mVec], n = N[nVec], u, v},
    u = Sin[KroneckerProduct[xi, m (N[Pi]/a)]].DiagonalMatrix[SparseArray[m^2]];
    v = Sin[KroneckerProduct[yi, n (N[Pi]/b)]];
    U = Transpose[MapThread[KroneckerProduct, {u, wix u}], {3, 1, 2}];
    V = MapThread[KroneckerProduct, {wiy v, v}];
    ];
  ]

0.000164

This part of the code has to be evaluated whenever D11 changes:

First@RepeatedTiming[

  cf = Block[{i},
    With[{code = D11[x,y] /. y -> Compile`GetElement[Y, i]},
     Compile[{{x, _Real}, {Y, _Real, 1}},
      Table[code, {i, 1, Length[Y]}],
      RuntimeAttributes -> {Listable},
      Parallelization -> True,
      RuntimeOptions -> "Speed"
      ]
     ]
    ];

  result = ArrayReshape[
    Transpose[
     Dot[U, (2. π^4/a^4 ) cf[xi, yi], V],
     {1, 3, 2, 4}
     ],
    {dim, dim}
    ];

  ]

0.00065

On my systen, roughly 40% of this timing are due to compilation of cf. Notice that the first argument of cf is a scalar, so inserting a vector (or any other rectangular array) as in cf[xi, yi] will call cf in a threadable way (using OpenMP parallelization IRRC). This is the sole purpose of the option Parallelization -> True; Parallelization -> True does nothing without RuntimeAttributes -> {Listable} or if cf is not called in such a threadable way. From what OP told me, it became clear that the function D11 changes frequently, so cf had to be compiled quite often. This is why compiling to C is not a good idea (the C-compiler needs much more time),

Finally, checking the relative error of result:

Max[Abs[D11Mat - result]]/Max[Abs[D11Mat]]

4.95633*10^-16

Explanation attempt

Well, the code might look mysterious, so I try to explain how I wrote it. Maybe that will help OP or others next time when they stumble into a similar problem.

The main problem here is this beast, which is the Flattening of a tensor of rank $6$:

W = Flatten@ Table[
 m^2 p^2 Sin[(m π x)/a] Sin[(p π x)/ a] Sin[(n π y)/b] Sin[(q π y)/b],
 {m, mVec}, {n, nVec}, {p, mVec}, {q, nVec}, {x, xi}, {y, yi}
 ];

The first step is to observe that the indices m, p, and x "belong together"; likewise we put n, q and y into a group. Now we can write W as an outer product of the following two arrays:

W1 = Table[ 
  m^2 p^2 Sin[(m π x)/a] Sin[(p π x)/a], 
  {m, mVec}, {p, mVec}, {x, xi}
  ];
W2 = Table[
  Sin[(n π y)/b] Sin[(q π y)/b], 
  {n, nVec}, {q, nVec}, {y, yi}
  ];

Check:

Max[Abs[W - Flatten[KroneckerProduct[W1, W2]]]]

2.84217*10^-14

Next observation: Up to transposition, W1 and W2 can also be obtained as lists of outer products (of things that can be constructed also by outer products and the Listable attribute of Sin):

u = Sin[KroneckerProduct[xi, m (N[Pi]/a)]].DiagonalMatrix[ SparseArray[m^2]];
v = Sin[KroneckerProduct[yi, n (N[Pi]/b)]];

Max[Abs[Transpose[MapThread[KroneckerProduct, {u, u}], {3, 1, 2}] - W1]]
Max[Abs[Transpose[MapThread[KroneckerProduct, {v, v}], {3, 1, 2}] - W2]]

7.10543*10^-14

8.88178*10^-16

From reverse engineering of OP's code (easier said than done), I knew that the result is a linear combination of W1, W2, wix, wiy, and the following matrix

A = (2 π^4)/a^4 Outer[D11, xi, yi];

The latter is basically the array mat1, but not flattened out. It was clear that the function D11 was inefficient, so I compiled it (in a threadable way) into the function cf, so that we can obtain A also this way

A = (2 π^4)/a^4 cf[xi, yi];

Next, I looked at the dimensions of these arrays:

Dimensions[A]
Dimensions[W1]
Dimensions[W2]
Dimensions[wix]
Dimensions[wiy]

{30, 30}

{10, 10, 30}

{10, 10, 30}

{30}

{30}

So there were only a few possibilities left to Dot these things together. So, bearing in mind that u and wix belong to xi and that v and wiy belong to yi, I guessed this one:

intermediateresult = Dot[
   Transpose[MapThread[KroneckerProduct, {u, u}], {3, 1, 2}],
   DiagonalMatrix[wix],
   A,
   DiagonalMatrix[wiy],
   MapThread[KroneckerProduct, {v, v}]
   ];

I was pretty sure that all the right numbers were contained already in intermediateresult, but probably in the wrong ordering (which can be fixed with Transpose later). To check my guess, I computed the relative error of the flattened and sorted arrays:

(Max[Abs[Sort[Flatten[D11Mat]] - Sort[Flatten[intermediateresult]]]])/Max[Abs[D11Mat]]

3.71724*10^-16

Bingo. Then I checked the dimensions:

Dimensions[intermediateresult]
Dimensions[D11Mat]

{10, 10, 10, 10}

{100, 100}

From the way D11Mat was constructed, I was convinced that up to a transposition, intermediateresult is just an ArrayReshaped version of D11Mat. Being lazy, I just let Mathematica try all permutations:

Table[
  perm -> 
   Max[Abs[ArrayReshape[
       Transpose[intermediateresult, perm], {dim, dim}] - D11Mat]],
  {perm, Permutations[Range[4]]}
  ]

{{1, 2, 3, 4} -> 6.01299*10^7, {1, 2, 4, 3} -> 6.01299*10^7, {1, 3, 2, 4} -> 2.23517*10^-8, ...}

Then I just picked the one with the smallest error (which was {1,3,2,4}). So our result can be constructed like this:

result = ArrayReshape[
   Transpose[
    Dot[
     Transpose[MapThread[KroneckerProduct, {u, u}], {3, 1, 2}],
     DiagonalMatrix[wix],
     A,
     DiagonalMatrix[wiy],
     MapThread[KroneckerProduct, {v, v}]
     ],
    {1, 3, 2, 4}
    ],
   {dim, dim}];

Of course, one should confirm this by a couple of randomized tests before one proceeds.

The rest is onky about a couple of local optimizations. Multiplication with DiagonalMatrixs can usually replaced by threaded multipication. Know that, I searched for places to stuff the weights wix and wiy and found this possibility:

result = ArrayReshape[
   Transpose[
    Dot[
     Transpose[MapThread[KroneckerProduct, {u, wix u}], {3, 1, 2}],
     A,
     MapThread[KroneckerProduct, {wiy v, v}]
     ],
    {1, 3, 2, 4}
    ],
   {dim, dim}];

Then I realized that the first and third factor of the Dot-product can be recycled; this is why I stored them in U and V. Replacing A by (2 π^4)/a^4 cf[xi, yi] then led to the piece of code above.

Addendum

Using MapThread is actually suboptimal and can be improved by CompiledFunction:

cg = Compile[{{u, _Real, 1}, {w, _Real}},
   Block[{ui},
    Table[
     ui = w Compile`GetElement[u, i];
     Table[ui Compile`GetElement[u, j], {j, 1, Length[u]}]
     , {i, 1, Length[u]}]
    ]
   ,
   CompilationTarget -> "C",
   RuntimeAttributes -> {Listable},
   Parallelization -> True,
   RuntimeOptions -> "Speed"
   ];

And now

v = RandomReal[{-1, 1}, {1000, 10}];
w = RandomReal[{-1, 1}, {1000}];
V = w MapThread[KroneckerProduct, {v, v}]; // RepeatedTiming // First
V2 = cg[v, w]; // RepeatedTiming // First

0.0023

0.00025

But the MapThreads have to be run only once and it is already very fast for the array sizes in the problem. Moreover, for those sizes, cg is only twice as fast as MapThread. So there is probably no point in optimizing this out.

I managed to achieve a 20-fold performance boost with the following ideas. First, the elements of your 6-dimensional intermediate array A[m, n, p, q, x, y] can be decomposed into pairwise products of X[m, p, x] and Y[n, q, y] - a square root reduction in trigonometric computations. Then, X and Y can be combined into A via heavily optimized functions Outer and Transpose.

cf = Compile[{{x1, _Real, 1}, {y1, _Real, 1}, {m1, _Real, 
    1}, {n1, _Real, 1}, {p1, _Real, 1}, {q1, _Real, 
    1}, {a, _Real}, {b, _Real}, {nof, _Integer}},
  Module[{X, Y},
   X = Table[
     m^2 p^2 Sin[(m \[Pi] x)/a] Sin[(p \[Pi] x)/a],
     {m, m1}, {p, p1}, {x, x1}];
   Y = Table[
     Sin[(n \[Pi] y)/b] Sin[(q \[Pi] y)/b],
     {n, n1}, {q, q1}, {y, y1}];
   Partition[#, nof^2] &@
    Flatten@Transpose[Outer[Times, X, Y], {1, 3, 5, 2, 4, 6}]
   ]
  ]

cf[xi, yi, mVec, nVec, mVec, nVec, a, b, nof]; // RepeatedTiming

That said, I expect @Roman's DST-based approach to be orders of magnitude faster.

Can someone boost even more my code?

Exploitation of low rank structure

Code

Explanation attempt

Addendum

Tags:

Matrix

Performance Tuning

Related

Recent Posts