Splitting strings in Apache Spark using Scala

So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.

val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]

val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]

val yourRdd = splitRdd.flatMap( arr => {
  val title = arr( 0 )
  val text = arr( 1 )
  val words = text.split( " " )
  words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]

// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )

// if you want to count ( this count is for non-unique words), 
val countRdd = yourRdd
  .groupBy( { case ( word, title ) => title } )  // group by title
  .map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title

This is how it can be solved using the newer dataframe API. First read the data using "\t" as a delimiter:

val df = spark.read
  .option("delimiter", "\t")
  .option("header", false)
  .csv("s3n://file.txt")
  .toDF("title", "text")

Then, split the text column on space and explode to get one word per row.

val df2 = df.select($"title", explode(split($"text", " ")).as("words"))

Finally, group on the title column and count the number of words for each.

val countDf = df2.groupBy($"title").agg(count($"words"))

Splitting strings in Apache Spark using Scala

Tags:

String

Scala

Apache Spark

Related

Recent Posts