scala - Join time series Cassandra tables in Spark -
i have 2 tables (agg_count_1
, agg_count_2
) in cassandra both same schema:
create table agg_count_1 ( pk_1 text, pk_2 text, pk_3 text, window_start timestamp, count counter, primary key (( pk_1, pk_2, pk_3 ), window_start) ) clustering order ( window_start desc )
window_start
timestamp rounded nearest 15 minutes means value same in both tables rows time windows may missing.
i efficiently (inner) join these 2 tables on primary key third table same schema , store value of agg_count_1.counter
counter_1
, agg_count_2.counter
counter_2
columns:
create table agg_joined ( pk_1 text, pk_2 text, pk_3 text, window_start timestamp, int counter_1, int counter_2, primary key (( pk_1, pk_2, pk_3 ), window_start) ) clustering order ( window_start desc )
this can done in many ways using combination of scala, spark , spark-cassandra connector features. what recommended way?
i appreciate hear solutions avoid. joins in general expensive expect kind of "zipping" of time series should efficient if (actually me) don't wrong.
based on spark-cassandra documentation using joinwithcassandratable
sounds suboptimal because executes single query every partition:
joinwithcassandratable
utilizes java drive execute single query every partition required source rdd no un-needed data requested or serialized.
Comments
Post a Comment