mapreduce - Internal working of Spark - Communication/Synchronization -
i quite new spark have programming experience in bsp model. in bsp model (e.g. apache hama), have handle communication , synchronization of nodes on our own. on 1 side because have finer control on want achieve on other hand adds more complexity.
spark on other hand, takes control , handles on own (which great) don't understand how works internally in cases have alot of data , message passing between nodes. let me put example
zb = sc.broadcast(z) r_i = x_i.map(x => math.pow(norm(x - zb.value), 2)) r_i.checkpoint() u_i = u_i.zip(x_i).map(ux => ux._1 + ux._2 - zb.value) u_i.checkpoint() x_i = f.prox(u_i.map(ui => {zb.value - ui}), rho) x_i.checkpoint() x = x_i.reduce(_+_) / f.numsplits.todouble u = u_i.reduce(_+_) / f.numsplits.todouble z = g.prox(x+u, f.numsplits*rho) r = math.sqrt(r_i.reduce(_+_))
this method taken here, runs in loop (let's 200 times). x_i contains our data (let's 100,000 entries).
in bsp style program if have process map operation, partition data , distribute on multiple nodes. each node process sub part of data (map operation) , return result master (after barrier synchronization). since master node wants process each individual result returned (centralized master- see figure below), send result of each entry master (reduce operator in spark). so, (only) master receives 100,000 messages after each iterations. processes data , sends new values slaves again again start processing next iteration.
now, since spark takes control user , internally everything, unable understand how spark collects data after map operations (asynchronous message passing? heard has p2p message passing ? synchronization between map tasks? if synchronization, right spark bsp model ?). in order apply reduce function, collects data on central machine (if yes, receives 100,000 messages on single machine?) or reduces in distributed fashion (if yes, how can performed ?)
following figure shows reduce function on master. x_i^k-1 represents i-th value calculated (in previous iteration) against x_i data entry of input. x_i^k represents value of x_i calculated in current iteration. clearly, equation, needs results collected.
i want compare both styles of distributed programming understand when use spark , when move bsp. further, looked alot on internet, find how map/reduce works nothing useful available on actual communication/synchronization. helful material useful aswell.
Comments
Post a Comment