Calculate 40 day moving average w.r.t to a field

It is not really clear from the question what is the role of the call_type_id column. I will ignore it until you clarify.

Without window functions

Here is a simple variant that doesn't use window functions at all.

Make sure that there is an index on (call_dt_key, aes_raw).

CTE_Dates returns a list of all dates in the table and calculates average for each day. This average_current_day will be needed for the first day. The server will scan the whole index any way, so calculating such average is cheap.

Then, for each distinct day I use a self-join to calculate the average for 40 previous days. This will return NULL for the first day, which is replaced with average_current_day in the main query.

You don't have to use CTE here, it just makes the query easier to read.

WITH
CTE_Dates
AS
(
    SELECT
        call_dt_key
        ,call_dt_key - INTERVAL '41 day' AS dt_from
        ,call_dt_key - INTERVAL '1 day' AS dt_to
        ,AVG(test_aes.aes_raw) AS average_current_day
    FROM test_aes
    GROUP BY call_dt_key
)
SELECT
    CTE_Dates.call_dt_key
    ,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40
FROM
    CTE_Dates
    LEFT JOIN LATERAL
    (
        SELECT AVG(test_aes.aes_raw) AS average_40
        FROM test_aes
        WHERE
                test_aes.call_dt_key >= CTE_Dates.dt_from
            AND test_aes.call_dt_key <= CTE_Dates.dt_to
    ) AS prev40 ON true
ORDER BY call_dt_key;

Result

|                call_dt_key |         average_40 |
|----------------------------|--------------------|
|  January, 01 2016 00:00:00 |                 15 |
|  January, 05 2016 00:00:00 |                 15 |
|  January, 10 2016 00:00:00 | 15.333333333333334 |
|  January, 15 2016 00:00:00 | 15.294117647058824 |
|  January, 16 2016 00:00:00 |               15.5 |
|  January, 20 2016 00:00:00 | 15.652173913043478 |
|  January, 21 2016 00:00:00 |               15.6 |
|  January, 31 2016 00:00:00 | 15.555555555555555 |
| February, 01 2016 00:00:00 | 15.517241379310345 |
| February, 10 2016 00:00:00 | 15.483870967741936 |
| February, 15 2016 00:00:00 | 15.652173913043478 |
| February, 26 2016 00:00:00 | 15.333333333333334 |
|    March, 04 2016 00:00:00 |                 15 |
|    March, 18 2016 00:00:00 |                 15 |

Here is SQL Fiddle.

With the recommended index this solution should not be too bad.


There is a similar question, but for SQL Server (Date range rolling sum using window functions). Postgres seems to support RANGE with a window of specified size, while SQL Server doesn't at this moment. So, solution for Postgres is likely to be a bit simpler.

The key part would be:

AVG(...) OVER (ORDER BY call_dt_key RANGE BETWEEN 41 PRECEDING AND 1 PRECEDING)

To calculate the moving average using these window functions you'd likely have to fill the gaps in dates first, so that the table has at least one row for each day (with NULL values for aes_raw in these dummy rows).

...

As Erwin Brandstetter correctly pointed out in his answer, at the moment (as of Postgres 9.5) the RANGE clause in Postgres still has limitations similar to SQL Server. Docs say:

The value PRECEDING and value FOLLOWING cases are currently only allowed in ROWS mode.

So, this method with the RANGE above would not work for you even if you used Postgres 9.5.


Using window functions

You can use approaches outlined in the question for SQL Server above. For example, group your data into daily sums, add rows for missing days, calculate the moving SUM and COUNT using OVER with ROWS and then calculate moving average.

Something along these lines:

WITH
CTE_Dates
AS
(
    SELECT
        call_dt_key
        ,SUM(test_aes.aes_raw) AS sum_daily
        ,COUNT(*) AS cnt_daily
        ,AVG(test_aes.aes_raw) AS avg_daily
        ,LEAD(call_dt_key) OVER(ORDER BY call_dt_key) - INTERVAL '1 day' AS next_date
    FROM test_aes
    GROUP BY call_dt_key
)
,CTE_AllDates
AS
(
    SELECT
        CASE WHEN call_dt_key = dt THEN call_dt_key ELSE NULL END AS final_dt
        ,avg_daily
        ,SUM(CASE WHEN call_dt_key = dt THEN sum_daily ELSE NULL END) 
            OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING)
        /SUM(CASE WHEN call_dt_key = dt THEN cnt_daily ELSE NULL END) 
            OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) AS avg_40
    FROM
        CTE_Dates
        INNER JOIN LATERAL
            generate_series(call_dt_key, COALESCE(next_date, call_dt_key), '1 day') 
            AS all_dates(dt) ON true
)
SELECT
    final_dt
    ,COALESCE(avg_40, avg_daily) AS final_avg
FROM CTE_AllDates
WHERE final_dt IS NOT NULL
ORDER BY final_dt;

Result is the same as in the first variant. See SQL Fiddle.

Again, this could be written with inlined sub-queries without CTEs.

It is worth checking on real data the performance of different variants.


The big bounty makes the currently accepted answer seem exemplary, but I am not entirely happy with several details. Hence, I added this answer.

Table definition

You should have provided an actual table definition to make this easier.

Judging from the sample data, call_dt_tm is type timestamp with time zone (timestamptz). The column call_dt_key is not completely functionally dependent, since the matching date depends on the time zone. But if you define that (not just an offset, beware of DST!), the date can easily and reliably be derived from a timestamptz and should not be stored redundantly. To get it right, use an expression like:

(call_dt_tm AT TIME ZONE 'Asia/Hong_Kong')::date  -- use your time zone name

Details:

  • Ignoring timezones altogether in Rails and PostgreSQL

You might add a MATERIALIZED VIEW with the derived date column for ease of use ...

For the purpose of this question I'll stick to your given table.

40 days

Question and answer both count 41 days instead of 40 as per requirement. Lower and upper bound are included, resulting in a (rather common) off-by-one error.

Consequently, I get different results in two rows below.

date, interval, timestamp

Subtracting an interval from a date produces a timestamp (like in call_dt_key - INTERVAL '41 day'). For the purpose of this query it is more efficient to subtract an integer, producing another date (like call_dt_key - 41).

Not possible with a RANGE clause

@Vladimir suggested (now fixed) a solution with the RANGE clause in the frame definition of window functions in Postgres 9.5.

In fact, nothing has changed between Postgres 9.4 and 9.5 in this respect, not even the text in the manual. Frame definition of window functions only allow RANGE UNBOUNDED PRECEDING and RANGE UNBOUNDED FOLLOWING - not with values.

Answer

Of course, you can use a CTE to compute daily sum / count / avg on the fly. But your table ...

stores the information about user calls in a call center

This kind of information does not change later. So compute daily aggregates once in a materialized view and build on that.

CREATE MATERIALIZED VIEW mv_test_aes AS
SELECT call_dt_key       AS day
     , sum(aes_raw)::int AS day_sum
     , count(*)::int     AS day_ct
FROM   test_aes
WHERE  call_dt_key < (now() AT TIME ZONE 'Asia/Hong_Kong')::date  -- see above
GROUP  BY call_dt_key
ORDER  BY call_dt_key;

The current day is always missing, but that's a feature. Results would be incorrect before the day is over.

The MV needs to be refreshed once per day, before you run your query or the latest day(s) are missing.

An index on the underlying table is not necessary for this, since the whole table is read anyway.

CREATE INDEX test_aes_day_val ON test_aes (call_dt_key, aes_raw);

You might build a smarter materialized view manually and only incrementally add new days instead of recreating everything with standard MVs. But that's beyond the scope of the question ...

I strongly suggest an index on the MV, though:

CREATE INDEX foo ON mv_test_aes (day, day_sum, day_ct);

I only appended day_sum and day_ct hoping for index-only scans. If you don't see those in your queries, you don't need the columns in the index.

SELECT t.day
     , round(COALESCE(sum(t1.day_sum) * 1.0 / sum(t1.day_ct)  -- * 1.0 to avoid int division
                         , t.day_sum  * 1.0 /      t.day_ct), 4) AS avg_40days
FROM   mv_test_aes t
LEFT   JOIN mv_test_aes t1 ON t1.day <  t.day
                          AND t1.day >= t.day - 40  -- not 41
GROUP  BY t.day, t.day_sum, t.day_ct
ORDER  BY t.day;

Result:

day        | avg_40days
-----------+------------
2016-01-01 | 15.0000
2016-01-05 | 15.0000
2016-01-10 | 15.3333
2016-01-15 | 15.2941
2016-01-16 | 15.5000
2016-01-20 | 15.6522
2016-01-21 | 15.6000
2016-01-31 | 15.5556
2016-02-01 | 15.5172
2016-02-10 | 15.4839
2016-02-15 | 15.5556  -- correct results
2016-02-26 | 15.0000
2016-03-04 | 15.0000
2016-03-18 | 15.0000

SQL Fiddle.

If you run this often, I would wrap the whole shebang into a MV to avoid repeated computation.

A solution with window functions and a frame clause ROWS BETWEEN .... would be possible, too. But your example data suggests that you don't have values for most of the days in the range (many more gaps than islands), so I don't expect it to be faster. Related: