Multiple Rows with null values to columns

Multiple Rows with null values to columns

Problem Description:

I’m using Pyspark and I have the next issue:

This is my input:

IDDate 1Date 2

My desire output would be like this:

IDDate 1Date 2

I try joining the table to itselft and do a select distinct

Solution – 1

So you would like to aggregate and have one line per ID while ignoring null values?

Here’s my answer:

from pyspark.sql.functions import first
df = df.groupBy("ID").agg(
    first("Date 1", ignorenulls=True).alias("Date 1"),
    first("Date 2", ignorenulls=True).alias("Date 2"),

This will however only return one line per ID.
If you have multiple non-null choices for Date1 and Date2, then it is better to use max and min or any other deterministic functions.

Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.