How to apply aplhabetic and numeric validation rule to database columns in pyspark?

How to apply aplhabetic and numeric validation rule to database columns in pyspark?

Problem Description:

I have one DB contains emp table ID,NAME,YEAR,AGE,DEPT columns.
I want to print pass if the NAME column passes the condition that contains characters only else fail.
And pass if year is in dd-mm-yyyy format else fail
pass if age col contains integers only else fail

And is it possible that above whole process can move to 1 function ?

Solution – 1

For each part of your question, you can use a trick.

name: you can use regular-expression with rlike() function.

date: you can cast date string to date format and check if it is valid.

name: you can cast to integer and check if it is valid.

note that if a cast is not valid pyspark returns Null.

schema = ['age', 'name', 'date']
data = [
    ("1", "A1", '30-12-2022'),
    ("2", "Aa", '36-11-2022'),
    ("3", "Aa", '2022-10-12'),
    ("4a", "Aa", '30-11-2022'),
    ("5", "Aa", '30-11-2022'),
]
df = spark.createDataFrame(data = data, schema = schema)
(
    df
    .filter(F.col('name').rlike("^[a-zA-Z]+$"))
    .filter(F.to_date(F.col('date'), 'dd-MM-yyyy').isNotNull())
    .filter(F.col('age').cast('int').isNotNull())
).show()

+---+----+----------+
|age|name|      date|
+---+----+----------+
|  5|  Aa|30-11-2022|
+---+----+----------+
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject