Emulate user-defined scalar function in a way which doesn't prevent parallelism

You cannot really safely achieve exactly what you want in SQL Server today, i.e. in a single statement and with parallel execution, within the restrictions laid out in the question (as I perceive them).

So my simple answer is no. The rest of this answer is mostly a discussion of why that is, in case it is of interest.

It is possible to get a parallel plan, as noted in the question, but there are two main varieties, neither of which are suitable for your needs:

  1. A correlated nested loops join, with a round-robin distribute streams on the top level. Given that a single row is guaranteed to come from Params for a specific session_id value, the inner side will run on a single thread, even though it is marked with the parallelism icon. This is why the apparently-parallel plan 3 does not perform as well; it is in fact serial.

  2. The other alternative is for independent parallelism on the inner side of the nested loops join. Independent here means that threads are started up on the inner side, and not merely the same thread(s) as are executing the outer side of the nested loops join. SQL Server only supports independent inner-side nested loops parallelism when there is guaranteed to be one outer-side row and there are no correlated join parameters (plan 2).

So, we have a choice of a parallel plan that is serial (due to one thread) with the desired correlated values; or an inner-side parallel plan that has to scan because it has no parameters to seek with. (Aside: It really ought to be allowed to drive inner-side parallelism using exactly one set of correlated parameters, but it has never been implemented, probably for good reason).

A natural question then is: why do we need correlated parameters at all? Why can SQL Server not simply seek directly to the scalar values provided by e.g. a subquery?

Well, SQL Server can only 'index seek' using simple scalar references, e.g. a constant, variable, column, or expression reference (so a scalar function result can also qualify). A subquery (or other similar construction) is simply too complex (and potentially unsafe) to push into the storage engine whole. So, separate query plan operators are required. This is turn requires correlation, which means no parallelism of the sort you want.

All in all, there really is no better solution currently than methods like assigning the lookup values to variables and then using those in the function parameters in a separate statement.

Now you may have specific local considerations that means caching the current values of the year and month in SESSION_CONTEXT is worthwhile i.e.:

SELECT FGSD.calculated_number, COUNT_BIG(*)
FROM dbo.f_GetSharedData
(
    CONVERT(integer, SESSION_CONTEXT(N'experiment_year')), 
    CONVERT(integer, SESSION_CONTEXT(N'experiment_month'))
) AS FGSD
GROUP BY FGSD.calculated_number;

But this falls into the category of workaround.

On the other hand, if aggregation performance is of primary importance, you could consider sticking with inline functions and creating a columnstore index (primary or secondary) on the table. You may find the benefits of columnstore storage, batch mode processing, and aggregate pushdown provide greater benefits than a row-mode parallel seek anyway.

But beware of scalar T-SQL functions, especially with columnstore storage, since it is easy to end up with the function being evaluated per-row in a separate row-mode Filter. It is generally quite tricky to guarantee the number of times SQL Server will choose to evaluate scalars, and better not to try.


As far as I know the plan shape that you want isn't possible with just T-SQL. It seems like you want the original plan shape (query 0 plan) with the subqueries from your functions being applied as filters directly against the clustered index scan. You'll never get a query plan like that if you don't use local variables to hold the return values of the scalar functions. The filtering will instead be implemented as a nested loop join. There are three different ways (from a parallelism point of view) that the loop join can be implemented:

  1. The entire plan is serial. This isn't acceptable to you. This is the plan that you get for query 1.
  2. The loop join runs in serial. I believe in this case the inner side can run in parallel, but it's not possible to pass any predicates down to it. So most of the work will be done in parallel, but you're scanning the whole table and the partial aggregate is much more expensive than before. This is the plan that you get for query 2.
  3. The loop join runs in parallel. With parallel nested loop joins the inner side of the loop runs in serial but you can have up to DOP threads running on the inner side at once. Your outer result set will just have a single row, so your parallel plan will effectively be serial. This is the plan that you get for query 3.

Those are the only possible plan shapes that I'm aware of. You can get some others if you use a temp table but none of them solve your fundamental problem if you want query performance to be just as good as it was for query 0.

You can achieve equivalent query performance by using the scalar UDFs to assign return values to local variables and using those local variables in your query. You can wrap that code in a stored procedure or a multi-statement UDF to avoid maintainability issues. For example:

DECLARE @experiment_year int = dbo.fn_GetExperimentYear(@session_id);
DECLARE @experiment_month int = dbo.fn_GetExperimentMonth(@session_id);

select
    calculated_number,
    count(*)
from dbo.f_GetSharedData(@experiment_year, @experiment_month)
group by
    calculated_number;

The scalar UDFs have been moved outside of the query that you wish to be eligible for parallelism. The query plan that I get appears to be the one that you want:

parallel query plan

Both approaches have disadvantages if you need to use this result set in other queries. You cannot directly join to a stored procedure. You'd have to save the results to a temp table which has its own set of problems. You can join to a MS-TVF, but in SQL Server 2016 you may see cardinality estimate issues. SQL Server 2017 offers interleaved execution for MS-TVF which could solve the problem entirely.

Just to clear up few things: T-SQL Scalar UDFs always prohibit parallelism and Microsoft has not said that FROID will be available in SQL Server 2017.


This can most likely be done using SQLCLR. One benefit of SQLCLR Scalar UDFs is that they do not prevent parallelism if they do not do any data access (and sometimes need to also be marked as "deterministic"). So how do you make use of something that requires no data access when the operation itself requires data access?

Well, because the dbo.Params table is expected to:

  1. generally never have more than 2000 rows in it,
  2. rarely change structure,
  3. only (currently) needs to have two INT columns

it is feasible to cache the three columns — session_id, experiment_year int, experiment_month — into a static collection (e.g. a Dictionary, perhaps) that is populated out-of-process and read by the Scalar UDFs that get the experiment_year int and experiment_month values. What I mean by "out-of-process" is: you can have a completely separate SQLCLR Scalar UDF or Stored Procedure that can do data access and reads from the dbo.Params table to populate the static collection. That UDF or Stored Procedure would be executed prior to using the UDFs that get the "year" and "month" values, that way the UDFs that get the "year" and "month" values aren't doing any DB data access.

The UDF or Stored Procedure that reads the data can check first to see if the collection has 0 entries and if so, then populate, else skip. You can even keep track of the time that it was populated and if it has been over X minutes (or something like that), then clear and re-populate even if there are entries in the collection. But skipping the population will help since it will need to be executed frequently to ensure that it is always populated for the two main UDFs to get the values from.

The main concern is when SQL Server decides to unload the App Domain for whatever reason (or it is triggered by something using DBCC FREESYSTEMCACHE('ALL');). You don't want to risk that collection being cleared out between the execution of the "populate" UDF or Stored Procedure and the UDFs to get the "year" and "month" values. In which case you can have a check at the very beginning of those two UDFs to throw an exception if the collection is empty, since it is better to error than to successfully provide false results.

Of course, the concern noted above assumes that the desire is to have the Assembly marked as SAFE. If the Assembly can be marked as EXTERNAL_ACCESS, then it is possible to have a static constructor execute the method that reads the data and populates the collection, so that you only ever need to manually execute that to refresh the rows, but they would always be populated (because the static class constructor always runs when the class is loaded, which happens whenever a method in this class is executed after a restart or the App Domain is unloaded). This requires using a regular connection and not the in-process Context Connection (which is not available to static constructors, hence the need for EXTERNAL_ACCESS).

Please note: in order to not be required to mark the Assembly as UNSAFE, you need to mark any static class variables as readonly. This means, at the very least, the collection. This is not a problem since read-only collections can have items added or removed from them, they just can't be initialized outside of the constructor or initial load. Tracking the time that the collection was loaded for the purpose of expiring it after X minutes is trickier since a static readonly DateTime class variable cannot be changed outside of the constructor or initial load. To get around this restriction, you need to use a static, read-only collection that contains a single item that is the DateTime value so that it can be removed and re-added upon a refresh.