Column operation on Spark RDDs in Python -


i have rdd many columns (e.g. hundreds), , of operation on columns, e.g. need create many intermediate variables different columns.

what efficient way this?

i create rdd csv file:

datardd = sc.textfile("/...path/*.csv").map(lambda line: line.split(",”)) 

for example, give me rdd below:

123, 523, 534, ..., 893  536, 98, 1623, ..., 98472  537, 89, 83640, ..., 9265  7297, 98364, 9, ..., 735  ......  29, 94, 956, ..., 758  

i need create new column or variable calculatedvalue = 2ndcol+19thcol , create new rdd.

123, 523, 534, ..., 893, calculatedvalue  536, 98, 1623, ..., 98472, calculatedvalue  537, 89, 83640, ..., 9265, calculatedvalue  7297, 98364, 9, ..., 735, calculatedvalue  ......  29, 94, 956, ..., 758, calculatedvalue 

what best way of doing this?

with map enough:

rdd = sc.parallelize([(1,2,3,4), (4,5,6,7)])  # replace index yours newrdd = rdd.map(lambda x: x + (x[1] + x[2],))   newrdd.collect() # [(1,2,3,4,6), (4,5,6,7,12)] 

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -