scala - Efficient grouping by key and StatCounter -
i aggregating values parameter below using apache-spark , scala. keeps adding values "list" there more efficient way list key , statcounter?
val predictorrawkey = predictorraw.map { x => val param = x._1 val val: double = x._2.todouble (param, val) }.mapvalues(num => list( num) ) .reducebykey((l1, l2) => l1 ::: l2) .map { x => x._1, statcounter(x._2.iterator))
for starters shouldn't use reducebykey
group values. more efficient omit map side aggregation , use groupbykey
directly.
fortunately statcounter
can work in streaming fashion , there no need group values @ all:
import org.apache.spark.util.statcounter val pairs = predictorrawkey.map(x => (x._1, x._2.todouble)) val predictorrawkey = pairs.aggregatebykey(statcounter(nil))( (acc: statcounter, x: double) => acc.merge(x), (acc1: statcounter, acc2: statcounter) => acc1.merge(acc2) )
Comments
Post a Comment