R data frame organisation

This is certainly a matter of opinion (totally agree with @MattB). Data frames are a very convenient way for many statistical analyses but many times you have to transform them to fit your purpose.

Your case shows a data frame in "wide form". I see no convenient way to add more facts about rowers. I would transform it to "long form". In the wide form each rower gets their own row. And since the rowers seem to be your "object of interest" (your cases) that could probably make things easier. The question "which races did rower 4 take part in?" could be answered easily with that form.


This is going to be a matter of opinion and will depend in part on what sort of questions you will want to ask of this dataset. For example, the question "which races did rower 4 take part in?" is not easily answered with the format above.

For that reason I would lean towards:

  • A table of races, much like you have, but without the seat* columns;
  • A table of rowers, where additional details (name, weight, etc.) can be kept; and
  • A table linking the two, with one row per rower per race.

This would avoid most redundancy and allow most questions (that I can think of!) to be answered relatively straightforwardly. You can always have a function (using, e.g., dcast) to recreate the form you show above for human-readability.


To create a table of events vs. rowers melt the data into long form m and then back into the appropriate wide form. There is no reason you can't have the data in multiple forms so it is really not necessary to choose the best forms. You can always regenerate them if new data comes in. The form of interest really depends on what you want to do with it but the code below gives you three forms:

  1. the original wide form df,
  2. the long form m which could be useful for regression, boxplots, etc. e.g.

    lm(time ~ factor(rower) + 0, m)
    boxplot(time ~ boat, m)
    
  3. the revised wide form df2.

If there exists rower specific attributes then those could be stored in a separate data frame with one row per rower and one column per attribute and depending on what you want to do could be merged with m using merge if you want to use those in a regression, say.

library(data.table)

m <- melt(as.data.table(df), id = 1:3, value.name = "rower")
df2 <- dcast(data = m, time + race + boat ~ rower, value.var = "rower")
setkey(df2, boat, race) # sort
df2

giving:

      time race boat  1  2  3  4  5  6  7  8
 1: 204.98    1    1  1  2 NA NA  5  6 NA NA
 2: 202.49    2    1 NA  2 NA  4  5 NA  7 NA
 3: 202.27    3    1 NA  2  3 NA NA  6  7 NA
 4: 206.48    4    1  1  2 NA NA NA NA  7  8
 5: 204.85    5    1 NA  2 NA  4 NA  6 NA  8
 6: 204.93    6    1 NA  2  3 NA  5 NA NA  8
 7: 204.91    1    2 NA NA  3  4 NA NA  7  8
 8: 207.40    2    2  1 NA  3 NA NA  6 NA  8
 9: 207.62    3    2  1 NA NA  4  5 NA NA  8
10: 203.41    4    2 NA NA  3  4  5  6 NA NA
11: 205.04    5    2  1 NA  3 NA  5 NA  7 NA
12: 204.96    6    2  1 NA NA  4 NA  6  7 NA

Alternately, with dplyr/tidyr:

library(dplyr)
library(tidyr)

m <- df %>%
  pivot_longer(-(1:3), names_to = "seat", values_to = "rower")
df2 <- m %>% 
  pivot_wider(1:3, names_from = rower, values_from = rower, names_sort = TRUE)

Tags:

R

Dataframe