Multiple Rows with null values to columns
Problem Description:
I’m using Pyspark and I have the next issue:
This is my input:
ID | Date 1 | Date 2 |
---|---|---|
1 | null | 2/1/2022 |
1 | 1/1/2022 | null |
My desire output would be like this:
ID | Date 1 | Date 2 |
---|---|---|
1 | 1/1/2022 | 2/1/2022 |
I try joining the table to itselft and do a select distinct
Solution – 1
So you would like to aggregate and have one line per ID while ignoring null values?
Here’s my answer:
from pyspark.sql.functions import first
df = df.groupBy("ID").agg(
first("Date 1", ignorenulls=True).alias("Date 1"),
first("Date 2", ignorenulls=True).alias("Date 2"),
)
df.show(truncate=False)
This will however only return one line per ID.
If you have multiple non-null choices for Date1 and Date2, then it is better to use max
and min
or any other deterministic functions.