How to apply aplhabetic and numeric validation rule to database columns in pyspark?
Problem Description:
I have one DB contains emp table ID,NAME,YEAR,AGE,DEPT columns.
I want to print pass if the NAME column passes the condition that contains characters only else fail.
And pass if year is in dd-mm-yyyy format else fail
pass if age col contains integers only else fail
And is it possible that above whole process can move to 1 function ?
Solution – 1
For each part of your question, you can use a trick.
name: you can use regular-expression with rlike() function.
date: you can cast date string to date format and check if it is valid.
name: you can cast to integer and check if it is valid.
note that if a cast is not valid pyspark returns Null.
schema = ['age', 'name', 'date']
data = [
("1", "A1", '30-12-2022'),
("2", "Aa", '36-11-2022'),
("3", "Aa", '2022-10-12'),
("4a", "Aa", '30-11-2022'),
("5", "Aa", '30-11-2022'),
]
df = spark.createDataFrame(data = data, schema = schema)
(
df
.filter(F.col('name').rlike("^[a-zA-Z]+$"))
.filter(F.to_date(F.col('date'), 'dd-MM-yyyy').isNotNull())
.filter(F.col('age').cast('int').isNotNull())
).show()
+---+----+----------+
|age|name| date|
+---+----+----------+
| 5| Aa|30-11-2022|
+---+----+----------+