scala - Efficient grouping by key and StatCounter -

scala - Efficient grouping by key and StatCounter -

i aggregating values parameter below using apache-spark , scala. keeps adding values "list" there more efficient way list key , statcounter?

val predictorrawkey = predictorraw.map { x =>       val param = x._1       val val: double = x._2.todouble       (param, val)     }.mapvalues(num => list( num) )      .reducebykey((l1, l2) => l1 ::: l2)      .map { x => x._1, statcounter(x._2.iterator))

for starters shouldn't use reducebykey group values. more efficient omit map side aggregation , use groupbykey directly.

fortunately statcounter can work in streaming fashion , there no need group values @ all:

import org.apache.spark.util.statcounter  val pairs = predictorrawkey.map(x => (x._1, x._2.todouble))  val predictorrawkey = pairs.aggregatebykey(statcounter(nil))(   (acc: statcounter, x: double) => acc.merge(x),   (acc1: statcounter, acc2: statcounter) => acc1.merge(acc2) )

Comments