Split date rows by new year

Here's a tidyverse based solution. It's similar to Lennyy's, but with fewer condition checks, and there's no issue with times being added (they might show up in a tibble, but as 00:00:00). I've added ungroup() because it sounds like you have a grouping variable somewhere (comment under Lennyy's solution). It can be removed if you don't:

library(dplyr)
library(lubridate)
library(purrr)

test %>% 
    ungroup() %>% # This isn't necessary if there are no groupings.
    split(rownames(test)) %>% 
    map_dfr(function(df){
        if (year(df$from_date) == year(df$to_date)) return(df)
        bind_rows(mutate(df, to_date = rollback(floor_date(to_date, "y"))),
                  mutate(df, from_date = floor_date(to_date, "y"))
                  )
    }
    )

#### OUTPUT ####

  ID Disease Pass Payment  from_date    to_date
1 10       P   US     110 2008-01-09 2008-12-31
2 10       P   US     110 2009-01-01 2009-01-08
3 10       P   US     110 2009-01-09 2009-12-31
4 10       P   US     110 2010-01-01 2010-01-08
5 10       P   US     115 2010-01-09 2010-12-31
6 10       P   US     115 2011-01-01 2011-01-08
7 12       D   EN     240 2008-01-01 2008-12-31
8 12       P   EN     255 2013-12-31 2013-12-31
9 12       P   EN     255 2014-01-01 2014-12-30

To explain: The dataframe is split into a list of rows. I then use map_dfr to run the function on each dataframe where from_date and to_date contain different years. map_dfr also binds the resulting dataframes together. Within the anonymous function I floor to_date by year, and then I either roll it back to the last day of the previous month for the new to_date in the first row, or leave it as it is for the new from_date in the second row.


Using from_date and to_date we can create a date sequence using seq.Date then split this sequence by year, finally select min and max of each year. Then use apply, separate_rows and separate to get the final result.

cr_date <- function(d1, d2){
    #browser()
    sequence_date <- seq.Date(as.Date(d1), as.Date(d2), by='day') 
    lst_dates <- lapply(split(sequence_date, lubridate::year(sequence_date)),
                        function(x) paste0(min(x), '|', max(x)))
    result <- paste0(lst_dates, collapse = ';')
    return(result)
  }

#Test
#cr_date(as.Date('2008-01-09'),as.Date('2009-01-08'))
test$flag <- apply(test, 1, function(x) cr_date(x['from_date'], x['to_date']))

library(tidyr)
separate_rows(test, flag, sep=';') %>% 
  separate(flag, into = c('from_date_new','to_date_new'), '\\|') %>% 
  mutate_at(vars('from_date_new','to_date_new'), list(~as.Date(.)))


    ID Disease Pass Payment  from_date    to_date from_date_new to_date_new
  1 10       P   US     110 2008-01-09 2009-01-08    2008-01-09  2008-12-31
  2 10       P   US     110 2008-01-09 2009-01-08    2009-01-01  2009-01-08
  3 10       P   US     110 2009-01-09 2010-01-08    2009-01-09  2009-12-31
  4 10       P   US     110 2009-01-09 2010-01-08    2010-01-01  2010-01-08
  5 10       P   US     115 2010-01-09 2011-01-08    2010-01-09  2010-12-31
  6 10       P   US     115 2010-01-09 2011-01-08    2011-01-01  2011-01-08
  7 12       D   EN     240 2008-01-01 2008-12-31    2008-01-01  2008-12-31
  8 12       P   EN     255 2013-12-31 2014-12-30    2013-12-31  2013-12-31
  9 12       P   EN     255 2013-12-31 2014-12-30    2014-01-01  2014-12-30

This uses only base R.

First note that only dates with no times are used so we should be using Date class, not POSIXct. The latter can needlessly introduce timezone errors unless you are very careful so in the Note at the end which shows the input used we assume that we are starting out with test2 which contains Date class data. The code in the Note also shows how to convert it to Date class if it it already POSIXct.

Given test2 we add from_year, to_year and eoy (date at the end of the year) columns giving test3. Then we iterate over the rows and if the years are the same return the row and if not return the split rows. This gives a list of one and two row data frames which we rbind together.

test3 <- transform(test2, 
  from_year = format(from_date, "%Y"),
  to_year = format(to_date, "%Y"),
  eoy = as.Date(sub("-.*", "-12-31", from_date)))

nr <- nrow(test2)
do.call("rbind", lapply(1:nr, function(i) with(test3[i, ],
  if (from_year == to_year) test2[i, ]
  else data.frame(ID, Disease, Pass, Payment, 
      from_date = c(from_date, eoy+1),
      to_date = c(eoy, to_date)))
))

Note

Assumed input in reproducible form. As noted above it uses Date class.

test2 <- transform(test, 
  from_date = as.Date(from_date),
  to_date = as.Date(to_date))

Tags:

R