How to keep some of the results of the NestList

Due to insistent public demand:

If, in a sequence of iterates $\{x,f(x),f(f(x)),\dots\}$, one only needs every $k$-th iterate (say, for $k=3$, you want $\{x,f(f(f(x))),f(f(f(f(f(f(x)))))),\dots\}$), then one can cleverly combine Nest[] and NestList[] like so:

NestList[Nest[f, #, k] &, start, n]

which yields a list containing the zeroth, $k$-th, $2k$-th, ... $nk$-th iterates.

Fold[f[#1] &, x, Range[#]] & /@ Range[0, 9, 3]
(* or *)
Nest[f, x, #] & /@ Range[0, 9, 3]
(* both give: *)
{x, f[f[f[x]]], f[f[f[f[f[f[x]]]]]], f[f[f[f[f[f[f[f[f[x]]]]]]]]]}

EDIT: a variation on the second method:

nestSkip[f_, x_, stepsize_Integer, numsteps_Integer] := 
   Nest[f, x, stepsize #] & /@ Range[0, numsteps]
(* examples: *)
nestSkip[g, y, 2, 2]
(* ==>  {y, g[g[y]], g[g[g[g[y]]]]} *)
nestSkip[# + 5 &, 2, 3, 3]
(* ==> {2, 17, 32, 47}  *)

J. M.'s method is elegant but in some cases it is not optimal. This is because the inner Nest[f, #, k] & does not compile the same as an explicit series of function calls. In advantageous cases if we expand the inner operation in advance we can have very large performance gains.

jmNest[f_, k_, n_, start_] :=
  NestList[Nest[f, #, k] &, start, n]

wNest[f_, k_, n_, start_] :=
  NestList[Evaluate @ Nest[f, #, k] &, start, n]

A favorable test case:

jmNest[1 + # &, 200, 1*^5, 10`] // RepeatedTiming // First
wNest[1 + # &, 200, 1*^5, 10`]  // RepeatedTiming // First


Here we get more than two orders of magnitude improvement, and it's easy to understand why as 200 applications of 1 + # & reduces to 200 + # &.

In a less trivial case such as three applications of Sin, which does not reduce to a simpler formula, we still have an improvement:

jmNest[Sin, 3, 1*^6, 10`] // RepeatedTiming // First
wNest[Sin, 3, 1*^6, 10`]  // RepeatedTiming // First


There are of course pathological cases as well where the symbolic expansion is far uglier than the sequential application:

jmNest[3 + # + Sqrt[# + 7] &, 15, 30, 10`] // RepeatedTiming // First
wNest[3 + # + Sqrt[# + 7] &, 15, 30, 10`]  // RepeatedTiming // First


I cannot think of a simple way to intelligently select between original and pre-expansion methods that does not introduce significant overhead. There is enough potential difference in speed that perhaps a ParallelTry is worth that overhead if unused cores are available for use.