Column operation on Spark RDDs in Python -

i have rdd many columns (e.g. hundreds), , of operation on columns, e.g. need create many intermediate variables different columns.

what efficient way this?

i create rdd csv file:

datardd = sc.textfile("/...path/*.csv").map(lambda line: line.split(",”))

for example, give me rdd below:

123, 523, 534, ..., 893  536, 98, 1623, ..., 98472  537, 89, 83640, ..., 9265  7297, 98364, 9, ..., 735  ......  29, 94, 956, ..., 758

i need create new column or variable calculatedvalue = 2ndcol+19thcol , create new rdd.

123, 523, 534, ..., 893, calculatedvalue  536, 98, 1623, ..., 98472, calculatedvalue  537, 89, 83640, ..., 9265, calculatedvalue  7297, 98364, 9, ..., 735, calculatedvalue  ......  29, 94, 956, ..., 758, calculatedvalue

what best way of doing this?

with map enough:

rdd = sc.parallelize([(1,2,3,4), (4,5,6,7)])  # replace index yours newrdd = rdd.map(lambda x: x + (x[1] + x[2],))   newrdd.collect() # [(1,2,3,4,6), (4,5,6,7,12)]

Search This Blog

Ben

Column operation on Spark RDDs in Python -

Comments

Post a Comment

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

post - imageshack API cURL -

dataset - MPAndroidchart returning no chart Data available -