Spark Udf function with Dataframe in input

581
January 11, 2017, at 02:30 AM

I have to develop a Spark script with python that checks some logs and verifies if a user has changed the country of his IP between two events. I have a csv file with IP ranges and associated countries saved on HDFS like this:

startIp, endIp, country
0.0.0.0, 10.0.0.0, Italy
10.0.0.1, 20.0.0.0, England
20.0.0.1, 30.0.0.0, Germany

And a log csv file:

userId, timestamp, ip, event
1, 02-01-17 20:45:18, 10.5.10.3, login
24, 02-01-17 20:46:34, 54.23.16.56, login

I load both files with a Spark Dataframe, and I've already modified the one that contains the logs with a lag function adding a column with the previousIp. The solution I thought is to substitute the ip and previousIp with the associated country in order to compare them and using a dataFrame.filter("previousIp" != "ip"). My question is, there is a way to do that in Spark? Something like:

dataFrame = dataFrame.select("userId", udfConvert("ip",countryDataFrame).alias("ip"), udfConvert("previousIp",countryDataFrame).alias("previousIp"),...)

In order to have a Dataframe like this:

userId, timestamp, ip, event, previousIp
1, 02-01-17 20:45:18, England, login, Italy

If not, how I can solve my problem? Thank you

READ ALSO
How do delete a node from a binary search tree in python?

How do delete a node from a binary search tree in python?

Hi i'm doing my first question on leetcodeI'm not sure why my code isn't working

527
An entry with parentheses in Python 3.5 only continues if the Enter key is pressed twice

An entry with parentheses in Python 3.5 only continues if the Enter key is pressed twice

I am trying to solve a simple exercise to identify if the parentheses used in an equation are correct, but when I try input an equation like this a+(b*c)-2-a , I have to press enter twiceThis only happens inside the EOF block

376
pandas pct_change returns wrong value when sorting by column

pandas pct_change returns wrong value when sorting by column

I cannot get my pandas dataframe to correctly put an output for pct_change() using the code as follows

584