My spark rumblings - mysunahara/mysunahara.github.io GitHub Wiki

Welcome to the mysunahara.github.io wiki!

Prepend a string to a df column using SPARK SQL function regexp_replace. def prependD (df,x): df1=df.withColumn(x,regexp_replace(df[x], "^", "D")) return df1 usage: df=prependD(df,"ADMTG_DGNS_CD")

=============================================================== def isICDCodePresent(cellValue): return bool(re.search("^D(410|412)",cellValue))

usage: isICDCodePresentUDF = udf(isICDCodePresent, BooleanType()) df = df.withColumn("mi", isICDCodePresentUDF(df["ADMTG_DGNS_CD"]))

====================================More than 1 argument for UDF======

##Below work with more than one argument
def moreThanTwoArgs(col1,col2): return bool((re.search("(276|715)",col1)or re.search("(276|715)",col2)))
comparatorUDF= udf(moreThanTwoArgs,BooleanType())
df = df.withColumn("mi", comparatorUDF(df["ICD_DGNS_CD1"], df["ICD_DGNS_CD2"]))

====== Usage when , otherwise

df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
      .toDF(["a", "b", "c"]))
df1=df.withColumn("mi", when(col("a") == 1, 0).otherwise(9))
df2 =df.withColumn("mi", when(col("a").rlike("1") | col("b").rlike("1"),bool(1)).otherwise(bool(0)))
df1.show()
df2.show()