Calculate Total Visits

There are a lot of questions and articles about packing time intervals. For example, Packing Intervals by Itzik Ben-Gan.

You can pack your intervals for the given user. Once packed, there will be no overlaps, so you can simply sum up the durations of packed intervals.

If your intervals are dates without times, I'd use a Calendar table. This table simply has a list of dates for several decades. If you do not have a Calendar table, simply create one:

CREATE TABLE [dbo].[Calendar](
    [dt] [date] NOT NULL,
CONSTRAINT [PK_Calendar] PRIMARY KEY CLUSTERED 
(
    [dt] ASC
));

There are many ways to populate such a table.

For example, 100K rows (~270 years) from 1900-01-01:

INSERT INTO dbo.Calendar (dt)
SELECT TOP (100000) 
    DATEADD(day, ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1, '19000101') AS dt
FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2
OPTION (MAXDOP 1);

See also Why are numbers tables "invaluable"?

Once you have a Calendar table, here is how to use it.

Each original row is joined with the Calendar table to return as many rows as there are dates between StartDate and EndDate.

Then we count distinct dates, which removes overlapping dates.

SELECT COUNT(DISTINCT CA.dt) AS TotalCount
FROM
    #Items AS T
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= T.StartDate
            AND dbo.Calendar.dt <= T.EndDate
    ) AS CA
WHERE T.CustID = 11205
;

Result

TotalCount
7

I strongly agree that a Numbers and a Calendar table are very useful and if this problem can be simplified a lot with a Calendar table.

I'll suggest another solution though (that doesn't need either a calendar table or windowed aggregates - as some of the answers from the linked post by Itzik do). It may not be the most efficient in all cases (or may be the worst in all cases!) but I don't think it harms to test.

It works by first finding start and end dates that do not overlap with other intervals, then puts them in two rows (separately the start and end dates) in order to assign them row numbers and finally matches the 1st start date with the 1st end date, the 2nd with the 2nd, etc.:

WITH 
  start_dates AS
    ( SELECT CustID, StartDate,
             Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                     ORDER BY StartDate)
      FROM items AS i
      WHERE NOT EXISTS
            ( SELECT *
              FROM Items AS j
              WHERE j.CustID = i.CustID
                AND j.StartDate < i.StartDate AND i.StartDate <= j.EndDate 
            )
      GROUP BY CustID, StartDate
    ),
  end_dates AS
    ( SELECT CustID, EndDate,
             Rn = ROW_NUMBER() OVER (PARTITION BY CustID 
                                     ORDER BY EndDate) 
      FROM items AS i
      WHERE NOT EXISTS
            ( SELECT *
              FROM Items AS j
              WHERE j.CustID = i.CustID
                AND j.StartDate <= i.EndDate AND i.EndDate < j.EndDate 
            )
      GROUP BY CustID, EndDate
    )
SELECT s.CustID, 
       Result = SUM( DATEDIFF(day, s.StartDate, e.EndDate) + 1 )
FROM start_dates AS s
  JOIN end_dates AS e
    ON  s.CustID = e.CustID
    AND s.Rn = e.Rn 
GROUP BY s.CustID ;

Two indexes, on (CustID, StartDate, EndDate) and on (CustID, EndDate, StartDate) would be useful for improving performance of the query.

An advantage over the Calendar (perhaps the only one) is that it can easily adapted to work with datetime values and counting the length of the "packed intervals" in different precision, larger (weeks, years) or smaller (hours, minutes or seconds, milliseconds, etc) and not only counting dates. A Calendar table of minute or seconds precision would be quite big and (cross) joining it to a big table would be a quite interesting experience but possibly not the most efficient one.

(thanks to Vladimir Baranov): It is rather difficult to have a proper comparison of performance, because performance of different methods would likely depend on the data distribution. 1) how long are the intervals - the shorter the intervals, the better Calendar table would perform, because long intervals would produce a lot of intermediate rows 2) how often intervals overlap - mostly non-overlapping intervals vs. most intervals covering the same range. I think performance of Itzik's solution depends on that. There could be other ways to skew the data and it's hard to tell how efficiency of the various methods would be affected.

This first query creates different Start Date and End Date ranges with no overlaps.

Note:

Your sample(id=0) is mixed with a sample from Ypercube (id=1)
This solution may not scale well with huge amount of data for each id or huge number of id. This has the advantage of not requiring a number table. With large dataset, a number table will very likely give better performances.

Query:

SELECT DISTINCT its.id
    , Start_Date = its.Start_Date 
    , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
    --, x1=itmax.End_Date, x2=itmin.Start_Date, x3=its.End_Date
FROM @Items its
OUTER APPLY (
    SELECT Start_Date = MAX(End_Date) FROM @Items std
    WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
) itmin
OUTER APPLY (
    SELECT End_Date = MIN(Start_Date) FROM @Items std
    WHERE std.Item_ID <> its.Item_ID+1000 AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
) itmax;

Output:

id  | Start_Date                    | End_Date                      
0   | 2015-01-23 00:00:00.0000000   | 2015-01-23 00:00:00.0000000   => 1
0   | 2015-01-24 00:00:00.0000000   | 2015-01-27 00:00:00.0000000   => 4
0   | 2015-01-29 00:00:00.0000000   | 2015-01-30 00:00:00.0000000   => 2
1   | 2016-01-20 00:00:00.0000000   | 2016-01-22 00:00:00.0000000   => 3
1   | 2016-01-23 00:00:00.0000000   | 2016-01-24 00:00:00.0000000   => 2
1   | 2016-01-25 00:00:00.0000000   | 2016-01-29 00:00:00.0000000   => 5

If you use these Start Date and End Date with DATEDIFF:

SELECT DATEDIFF(day
    , its.Start_Date 
    , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
) + 1
...

Output (with duplicates) is:

1, 4 and 2 for id 0 (your sample => SUM=7)
3, 2 and 5 for id 1 (Ypercube sample => SUM=10)

You then only need to put everything together with a SUM and GROUP BY:

SELECT id 
    , Days = SUM(
        DATEDIFF(day, Start_Date, End_Date)+1
    )
FROM (
    SELECT DISTINCT its.id
         , Start_Date = its.Start_Date 
        , End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
    FROM @Items its
    OUTER APPLY (
        SELECT Start_Date = MAX(End_Date) FROM @Items std
        WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
    ) itmin
    OUTER APPLY (
        SELECT End_Date = MIN(Start_Date) FROM @Items std
        WHERE std.Item_ID <> its.Item_ID AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
    ) itmax
) as d
GROUP BY id;

Output:

id  Days
0   7
1   10

Data used with 2 different ids:

INSERT INTO @Items
    (id, Item_ID, Start_Date, End_Date)
VALUES 
    (0, 20009, '2015-01-23', '2015-01-26'),
    (0, 20010, '2015-01-24', '2015-01-24'),
    (0, 20011, '2015-01-23', '2015-01-26'),
    (0, 20012, '2015-01-23', '2015-01-27'),
    (0, 20013, '2015-01-23', '2015-01-27'),
    (0, 20014, '2015-01-29', '2015-01-30'),

    (1, 20009, '2016-01-20', '2016-01-24'),
    (1, 20010, '2016-01-23', '2016-01-26'),
    (1, 20011, '2016-01-25', '2016-01-29')

Calculate Total Visits

Query:

Output:

Output:

Data used with 2 different ids:

Tags:

Sql Server

Sql Server 2008 R2

Gaps And Islands

Related

Recent Posts