Dataset and Select with a SmoothKernelDistribution

Short, Short Version

A General::stop message will cause a dataset query to fail if some other explicitly quieted message occurs three or more times. For example:

Dataset[1][Quiet[Do[Message[f::ivar, #], 2]; #, f::ivar] &]
(* 1                -- success: only two quieted f::ivar messages generated *)

Dataset[1][Quiet[Do[Message[f::ivar, #], 3]; #, f::ivar] &]
(* Failure[...]     -- third f:ivar message triggers a General::stop and failure *)

Short Version

The behaviour we see is due to an unfortunate combination of circumstances:

  1. Generating the PDF of a SmoothKernelDistribution generates some messages which are internally suppressed using Quiet.
  2. Distribution results are routinely cached so that the messages in question are only generated during the first evaluation.
  3. FailureAction -> "Abort" is the default setting for Dataset queries and it causes queries to fail if any unexpected messages are generated. A message is unexpected if it has not been explicitly quieted by code executed within the body of the query.
  4. PDF[...] is generating so many Part::partd messages that the system issues a General::stop message. Normally, a General::stop receives special treatment and quieted along with the messages that generated it, but the hypersensitive FailureAction machinery classifies it as an unexpected message and fails the query.

Work-around

A work-around is to turn off the special message handling from the Dataset query:

ds[Select[func] /* Length, FailureAction -> None]
(* 397 *)

Is It A Bug?

Yes, I think so. Given that in normal circumstances General::stop is only generated after some other message has appeared for the third time, the bug would be corrected if General::stop was always presumed to be quieted by GeneralUtilities`MessageQuietedQ (see discussion below).



Spelunking Report (current as of version 11.0.1)

... be prepared for too much complicated trivia ...

Let's begin by finding a minimal set of steps to reproduce the behaviour:

d = SmoothKernelDistribution[{{1.,2.}}];

Dataset[{1., 2.}][PDF[d, #]&]
(* Failure[...] *)

PDF[d, {1., 2.}]
(* 0.390901 *)

Dataset[{1., 2.}][PDF[d, #]&]
(* 0.390901 *)

d = SmoothKernelDistribution[{{1.,2.}}];

Dataset[{1., 2.}][PDF[d, #]&]
(* Failure[...] *)

Notice how the bad behaviour disappears after performing a non-query evaluation and how it reappears after reassigning a newly created distribution to d. (bonus trivia: using the operator form PDF[d] always fails as it appears to defeat the cache)

PDF Evaluation Generates Messages

By blocking the action of Quiet, we can observe that PDF[d, {1., 2.}] generates messages that would normally be quieted:

d = SmoothKernelDistribution[{{1.,2.}}];
Block[{Quiet}, PDF[d, {1., 2.}]]

(*
  >> Part::partd: Part specification $x[[1]] is longer than depth of object.
  >> Part::partd: Part specification $x[[2]] is longer than depth of object.
  >> Part::partd: Part specification $x[[1]] is longer than depth of object.
  >> General::stop: Further output of Part::partd will be suppressed during this calculation.

  0.390901
*)

... but only the first time:

Block[{Quiet}, PDF[d, {1., 2.}]]
(* 0.390901 *)

We can use an alternative technique to view considerably more details about the generated messages and their evaluation contexts:

d = SmoothKernelDistribution[{{1.,2.}}];
Internal`HandlerBlock[{"Message",Print[Internal`QuietStatus[]]&}, PDF[d,  {1., 2.}]]

(*
{Global->Unquiet,Off->{Part::partd},On->{},Stack->{{1458,{Part::partd},{}}},MessageList->{{Part::partd,1458}},Check->None}
{Global->Unquiet,Off->{Part::partd},On->{},Stack->{{1458,{Part::partd},{}}},MessageList->{{Part::partd,1458},{Part::partd,1458}},Check->None}
...
{Global->Unquiet,Off->{Part::partd},On->{},Stack->{{1458,{Part::partd},{}}},MessageList->{{Part::partd,1458},{Part::partd,1458},{Part::partd,1458},{General::stop,1458}},Check->None}
{Global->Unquiet,Off->{Part::partd},On->{},Stack->{{1458,{Part::partd},{}}},MessageList->{{Part::partd,1458},{Part::partd,1458},{Part::partd,1458},{General::stop,1458}},Check->None}

0.390901
*)

We will shortly see that the Stack values from Internal`QuietStatus are relevant.

Distribution Calculations Use A Cache

The presence of a cache is strongly hinted by the fact that the messages disappear during the first evaluation of PDF but reappear if we regenerate the distribution itself. We can confirm this guess by tracing calls to StoreDataDistributionExpression:

On[Statistics`DataDistributionUtilities`StoreDataDistributionExpression]

System`Dump`DeactivateReadProtected[{DataDistribution, PDF}
, Print["#### define"]; d = SmoothKernelDistribution[{{1.,2.}}]
; Print["#### pdf1"];   PDF[d,{1.,2.}]
; Print["#### pdf2"];   PDF[d,{1.,2.}]
]

Off[]

(* Output:
#### define
   ... messages showing calls to StoreDataDistributionExpression ...
#### pdf1
   ... messages showing more calls to StoreDataDistributionExpression ...
#### pdf2
   ... no messages! ...
*)

Observe how the initial distribution generation tucked some values into a cache, as did the first PDF evaluation. But the second PDF evaluation did not store into the cache at all.

FailureAction Treats Messages As Query Failure

Now, we can move on to FailureAction in dataset queries. As noted earlier, the default action is to cause a query to fail should any messages be generated:

Dataset[1][(Message[f::ivar, #]; #) &]
(* Failure[...] *)

But quieted messages do not (normally) cause failure:

Dataset[1][Quiet[Message[f::ivar, #]; #] &]
(* 1 *)

We can also turn off the failure processing:

Dataset[1][(Message[f::ivar, #]; #) &, FailureAction -> None]
(* 
    >> f: 1 is not a valid variable
    1
*)

Unexpected Messages Trigger FailureAction

But if the previous example shows that quieted messages do not trigger FailureAction, what is so special about the quieted messages we observed for this:

d = SmoothKernelDistribution[{{1.,2.}}];
Dataset[{1., 2.}][PDF[d, #]&]
(* Failure[...] *)

To find the answer, we need to closely inspect a voluminous trace of the evaluation (not reproduced here). Deep in the bowels of that trace, we find that the relevant functions are:

Needs["GeneralUtilities`"]
{ EvaluateChecked, GeneralUtilities`Failure`PackagePrivate`CheckedHandler
, MessageStackID, MessageQuietedQ
} // Scan[PrintDefinitionsLocal]

In short, EvaluateChecked begins by setting up an environment which will catch and handle any messages. MessageStackID is used to identify the outermost stack boundary of that environment. Should a message appear, CheckedHandler will inspect the message to determine if it should be ignored. Any message that is not MessageQuietedQ will cause a failure. MessageQuietedQ uses the Internal`QuietStatus[] output we saw earlier to see if the message has been explicitly quieted by code within the bounded stack environment.

Putting It All Together

In the case at hand, PDF[...] explicitly quiets the message Part::partd. But is does not explicitly quiet General::stop.

And so, to finally reach the end of our long-winded shaggy dog story... it is the General::stop message that causes the query to fail. That message only appears the first time the PDF function is generated because the function is cached on subsequent attempts.


Workaround with Efficiency Improvement

Another approach, which also makes your code more efficient, is to create the PDF of the SmoothKernelDistribution only once.

When written as

func = PDF[kd, {#"a", #"b"}] >= kernelProbability &;
ds[Select[func] /* Length]

kd is resolved into its PDF for each row in the Dataset because, as Attributes tells us, Function has attribute HoldAll. Here it is not too complicated for PDF to resolve kd into its PDF so you don't really notice this.

However, we can resolve the PDF of kd only once and then evaluate it repeatedly in the Query.

ClearAll[x, y];
pdf[x_, y_] = PDF[kd, {x, y}];
func1 = pdf[#"a", #"b"] >= kernelProbability &;
ds[Select[func1] /* Length]

pdf is Set to the resolved PDF of kd outside of the Select. Now for each row the resolved PDF (pdf) is evaluated. The overhead of resolving it for each row is removed. This not only makes your code more efficient but bypasses the bug as well.

In general be aware of the cost of your expression when doing row-by-row evaluations.

Hope this helps.