How to select all columns of a dataframe in join - Spark-scala

With alias:

first_df.alias("fst").join(second_df, Seq("id"), "left_outer").select("fst.*")

Suppose you:

  1. Want to use the DataFrame syntax.
  2. Want to select all columns from df1 but only a couple from df2.
  3. This is cumbersome to list out explicitly due to the number of columns in df1.

Then, you might do the following:

val selectColumns = df1.columns.map(df1(_)) ++ Array(df2("field1"), df2("field2"))
df1.join(df2, df1("key") === df2("key")).select(selectColumns:_*)

We can also do it with leftsemi join. leftsemi join will select the data from left side dataframe from a joined dataframe.

Here we join two dataframes df1 and df2 based on column col1.

    df1.join(df2, df1.col("col1").equalTo(df2.col("col1")), "leftsemi")