Spark and Not Serializable DateTimeFormatter

Another approach is to make the DateTimeFormatter transient. This tells the JVM/Spark that the variable is not to be serialized, and instead constructed from scratch. For something that is cheap to construct per executor, like a DateTimeFormatter, this is a good approach.

Here's an article that describes this in more detail.


You can avoid serialization in two ways:

  1. Assuming its value can be constant, place the formatter in an object (making it "static"). This would mean that the static value can be accessed within each worker, instead of the driver serializing it and sending to worker:

    object MyUtils {
      val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")
    }
    
    import MyUtils._
    logs.flatMap(fileContent => {
      // can safely use formatter here
    })
    
  2. instantiate it per record inside the anonymous function. This carries some performance penalty (as the instantiation will happen over and over, per record), so only use this option if the first can't be applied:

    logs.flatMap(fileContent => {
      val dtFormatter = DateTimeFormatter.ofPattern("<some non-ISO pattern>")
      // use formatter here
    })