Spark union column order

The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:

Return a new DataFrame containing union of rows in this and another frame.

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.

Also as standard in SQL, this function resolves columns by position (not by name).

Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.


in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question