Extracting `Seq[(String,String,String)]` from spark DataFrame

Well, it doesn't claim that it is a tuple. It claims it is a struct which maps to Row:

import org.apache.spark.sql.Row

case class Feature(lemma: String, pos_tag: String, ne_tag: String)
case class Record(id: Long, content_processed: Seq[Feature])

val df = Seq(
  Record(1L, Seq(
    Feature("ancient", "jj", "o"),
    Feature("olympia_greece", "nn", "location")
  ))
).toDF

val content = df.select($"content_processed").rdd.map(_.getSeq[Row](0))

You'll find exact mapping rules in the Spark SQL programming guide.

Since Row is not exactly pretty structure you'll probably want to map it to something useful:

content.map(_.map {
  case Row(lemma: String, pos_tag: String, ne_tag: String) => 
    (lemma, pos_tag, ne_tag)
})

or:

content.map(_.map ( row => (
  row.getAs[String]("lemma"),
  row.getAs[String]("pos_tag"),
  row.getAs[String]("ne_tag")
)))

Finally a slightly more concise approach with Datasets:

df.as[Record].rdd.map(_.content_processed)

or

df.select($"content_processed").as[Seq[(String, String, String)]]

although this seems to be slightly buggy at this moment.

There is important difference the first approach (Row.getAs) and the second one (Dataset.as). The former one extract objects as Any and applies asInstanceOf. The latter one is using encoders to transform between internal types and desired representation.