Removal of constant columns in R

I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.

For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.

df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]

This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.

ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix

time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method

time1
#   user  system elapsed 
# 22.267   0.194  22.626 
time2
#   user  system elapsed 
#  2.073   0.077   2.155 
time3
#   user  system elapsed 
#  6.702   0.060   6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE

The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :

df <- data.frame(x=1:5, y=rep(1,5))
df
#   x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1

# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y" 

So if you want to exclude these columns, you can use :

df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]

EDIT : In fact it is simpler to use apply instead. Something like this :

df[,apply(df, 2, var, na.rm=TRUE) != 0]

The janitor library has the comment remove_constant that can help delete constant columns. Let's create a synthesis data for illustration:

library(janitor)
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
test_dat

This is the test_dat

> test_dat
   A  B C
1  1  1 A
2  1  2 B
3  1  3 C
4  1  4 D
5  1  5 E
6  1  6 F
7  1  7 G
8  1  8 H
9  1  9 I
10 1 10 J

then the comment remove_constant can help delete the constant column

remove_constant(test_dat)
remove_constant(test_dat, na.rm= TRUE)

Using the above two comments, we will get:

B C
1   1 A
2   2 B
3   3 C
4   4 D
5   5 E
6   6 F
7   7 G
8   8 H
9   9 I
10 10 J

NOTE: use the argument na.rm = TRUE to make sure that any column having one value and NA will also be deleted. For example,

test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
test_dat_with_NA

the test_dat_with_NA we get:

  A  B C
1   1  1 A
2  NA  2 B
3   1  3 C
4  NA  4 D
5   1  5 E
6  NA  6 F
7   1  7 G
8  NA  8 H
9   1  9 I
10 NA 10 J

then the comment

remove_constant(test_dat_with_NA)

could not delete the column A

 A  B C
1   1  1 A
2  NA  2 B
3   1  3 C
4  NA  4 D
5   1  5 E
6  NA  6 F
7   1  7 G
8  NA  8 H
9   1  9 I
10 NA 10 J

while the comment

remove_constant(test_dat_with_NA, na.rm= TRUE)

could delete the column A with only value 1 and NA:

 B C
1   1 A
2   2 B
3   3 C
4   4 D
5   5 E
6   6 F
7   7 G
8   8 H
9   9 I
10 10 J

Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and @raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.

This a function I implemented in dataPreparation package.

First build an example data.table, with more lines than columns (which is usually the case) and 10% of NAs

ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)

Then benchmark all approaches:

time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method

The results are the following:

time1 # Variance approch
#   user  system elapsed 
#   2.55    1.45    4.07
time2 # Min = max approach
#   user  system elapsed 
#  2.72      1.5    4.22
time3 # length(unique()) approach
#   user  system elapsed 
#    6.7    2.75    9.53
time4 # Exponential search approach
#   user  system elapsed 
#   0.39    0.07    0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE

dataPreparation:which_are_constant is 10 times faster than the other approaches.

Plus the more rows you have the more interesting it is to use.

Tags:

Constants

R