Label Encoder functionality in R?

Create your vector of data:

colors <- c("red", "red", "blue", "green")

Create a factor:

factors <- factor(colors)

Convert the factor to numbers:

as.numeric(factors)

Output: (note that this is in alphabetical order)

# [1] 3 3 1 2

You can also set a custom numbering system: (note that the output now follows the "rainbow color order" that I defined)

rainbow <- c("red","orange","yellow","green","blue","purple")
ordered <- factor(colors, levels = rainbow)
as.numeric(ordered)
# [1] 1 1 5 4

See ?factor.


If I correctly understand what do you want:

# function which returns function which will encode vectors with values  of 'vec' 
label_encoder = function(vec){
    levels = sort(unique(vec))
    function(x){
        match(x, levels)
    }
}

colors = c("red", "red", "blue", "green")

color_encoder = label_encoder(colors) # create encoder

encoded_colors = color_encoder(colors) # encode colors
encoded_colors

new_colors = c("blue", "green", "green")  # new vector
encoded_new_colors = color_encoder(new_colors)
encoded_new_colors

other_colors = c("blue", "green", "green", "yellow") 
color_encoder(other_colors) # NA's are introduced

# save and restore to disk
saveRDS(color_encoder, "color_encoder.RDS")
c_encoder = readRDS("color_encoder.RDS")
c_encoder(colors) # same result

# dealing with multiple columns

# create data.frame
set.seed(123) # make result reproducible
color_dataframe = as.data.frame(
    matrix(
        sample(c("red", "blue", "green",  "yellow"), 12, replace = TRUE),
        ncol = 3)
)
color_dataframe

# encode each column
for (column in colnames(color_dataframe)){
    color_dataframe[[column]] = color_encoder(color_dataframe[[column]])
}
color_dataframe

Try CatEncoders package. It replicates the Python sklearn.preprocessing functionality.

# variable to encode values
colors = c("red", "red", "blue", "green")
lab_enc = LabelEncoder.fit(colors)

# new values are transformed to NA
values = transform(lab_enc, c('red', 'red', 'yellow'))
values

# [1]  3  3 NA


# doing the inverse: given the encoded numbers return the labels
inverse.transform(lab_enc, values)
# [1] "red" "red" NA   

I would add the functionality of reporting the non-matching labels with a warning.

PS: It also has the OneHotEncoder function.

Tags:

R