pyspark create dictionary from data in two columns

You can simply do this:

dict = {row['zipcode']:row['dma'] for row in df.collect()}
print(dict)
#{'58542': 'MIN', '58701': 'MIN', '57632': 'MIN', '58734': 'MIN'}

You can avoid using a udf here using pyspark.sql.functions.struct and pyspark.sql.functions.to_json (Spark version 2.1 and above):

import pyspark.sql.functions as f
from pyspark.sql import Row

data = [
    Row(zip_code='58542', dma='MIN'),
    Row(zip_code='58701', dma='MIN'),
    Row(zip_code='57632', dma='MIN'),
    Row(zip_code='58734', dma='MIN')
]

df = spark.createDataFrame(data)

df.withColumn("json", f.to_json(f.struct("dma", "zip_code"))).show(truncate=False)
#+---+--------+--------------------------------+
#|dma|zip_code|json                            |
#+---+--------+--------------------------------+
#|MIN|58542   |{"dma":"MIN","zip_code":"58542"}|
#|MIN|58701   |{"dma":"MIN","zip_code":"58701"}|
#|MIN|57632   |{"dma":"MIN","zip_code":"57632"}|
#|MIN|58734   |{"dma":"MIN","zip_code":"58734"}|
#+---+--------+--------------------------------+

If you instead wanted the zip_code to be the key, you can create a MapType directly using pyspark.sql.functions.create_map:

df.withColumn("json", f.create_map(["zip_code", "dma"])).show(truncate=False)
#+---+--------+-----------------+
#|dma|zip_code|json             |
#+---+--------+-----------------+
#|MIN|58542   |Map(58542 -> MIN)|
#|MIN|58701   |Map(58701 -> MIN)|
#|MIN|57632   |Map(57632 -> MIN)|
#|MIN|58734   |Map(58734 -> MIN)|
#+---+--------+-----------------+

pyspark create dictionary from data in two columns

Tags:

Python

Pyspark

Related

Recent Posts