Column operation on Spark RDDs in Python -
i have rdd many columns (e.g. hundreds), , of operation on columns, e.g. need create many intermediate variables different columns.
what efficient way this?
i create rdd csv file:
datardd = sc.textfile("/...path/*.csv").map(lambda line: line.split(",”))
for example, give me rdd below:
123, 523, 534, ..., 893 536, 98, 1623, ..., 98472 537, 89, 83640, ..., 9265 7297, 98364, 9, ..., 735 ...... 29, 94, 956, ..., 758
i need create new column or variable calculatedvalue = 2ndcol+19thcol , create new rdd.
123, 523, 534, ..., 893, calculatedvalue 536, 98, 1623, ..., 98472, calculatedvalue 537, 89, 83640, ..., 9265, calculatedvalue 7297, 98364, 9, ..., 735, calculatedvalue ...... 29, 94, 956, ..., 758, calculatedvalue
what best way of doing this?
with map enough:
rdd = sc.parallelize([(1,2,3,4), (4,5,6,7)]) # replace index yours newrdd = rdd.map(lambda x: x + (x[1] + x[2],)) newrdd.collect() # [(1,2,3,4,6), (4,5,6,7,12)]
Comments
Post a Comment