PostgreSQL, R: Multiply all rows of table to create Panel-data (time-series) -

i have table buildings 3.2 million rows. need expand table 11 different periods, handle (balanced) paneldata. means every object there 11 different years (from 2000-2010) observate. periods should called:

2000 2001 ... 2009 2010

table definitions

create table public.buildings (   gid integer not null default nextval('buildings_gid_seq'::regclass),   osm_id character varying(11),   name character varying(48),   type character varying(16),   geom geometry(multipolygon,4326),   centroid geometry(point,4326),   gembez character varying(50),   gemname character varying(50),   krsbez character varying(50),   krsname character varying(50),   pv boolean,   gr smallint,   capac double precision,   instdate date,   pvid integer,   dist double precision,   gemewz integer,   n500 integer,   ibase double precision,   popden integer,   instp smallint,   b2000 double precision,   b2001 double precision,   b2002 double precision,   b2003 double precision,   b2004 double precision,   b2005 double precision,   b2006 double precision,   b2007 double precision,   b2008 double precision,   b2009 double precision,   b2010 double precision,   ibase_id integer[],   ibase_dist integer[],   constraint buildings_pkey primary key (gid) ) (   oids=false ); alter table public.buildings   owner postgres;  create index build_centroid_gix   on public.buildings   using gist   (st_transform(centroid, 31467));  create index buildings_geom_idx   on public.buildings   using gist   (geom);

i want use data regression analysis in r.

ibase_id array of gid. ibase_dist related array distances of gid's obejct. length of 2 arrays same.

the gid's in array belong records of buildings, within radius of 500m around centroid,the center of obeject, , have pv=true (which implies dist,instdate,instp,capac&pvid not null).

select a.gid buildid, array_agg(b.gid) ibase_id, array_agg(round(st_distance(st_transform(a.centroid, 31467), st_transform(b.centroid, 31467))::integer)) ibase_dist   buildings   left join (select * buildings pv=true) b on st_dwithin(st_transform(a.centroid, 31467), st_transform(b.centroid, 31467), 500.0)       , a.gid <> b.gid   group a.gid

example:

ibase_id: {3075528,409073,322311,226643,833798,322344,226609};

ibase_dist {290,293,398,494,411,381,384}

update buildings set ibase=sum(1/s) unnest(select ibasedist buildings (select instp         buildings         gid in unnest(ibase_id))<year) s

for each period, entrys of arrays shall considered, years before observated period of panel-data. (the query above not work, yet, because need concastenate arrays first) right now, 2 arrays hold information of years. that's why thought should added each period of time, after expand panel-data, calculate ibasefor each record (11x 3,2 million).

i don't need of columns regression analysis. in case dramatically improve performance of multiplying, stick rows (basically leaving out geometry columns):

   gid integer not null default nextval('buildings_gid_seq'::regclass),       gembez character varying(50),       gemname character varying(50),       krsbez character varying(50),       krsname character varying(50),       pv boolean,       gr smallint,       capac double precision,       dist double precision,       gemewz integer,       n500 integer,       ibase double precision,       popden integer,       instp smallint,       b2000 double precision,       b2001 double precision,       b2002 double precision,       b2003 double precision,       b2004 double precision,       b2005 double precision,       b2006 double precision,       b2007 double precision,       b2008 double precision,       b2009 double precision,       b2010 double precision,       ibase_id integer[],       ibase_dist integer[],       constraint buildings_pkey primary key (gid)     )     (       oids=false

solution approach

i had basic idea create second table periods containing 11 different periods , multiply table table buildings. not sure how implement this. unfortunately don't have experience r , don't use database interface r, yet.

working postgresql 9.5beta2, compiled visual c++ build 1800, 64-bit , r x64 3.2.1

essentially, panel data set data in long format repeating years each record time column. current structure in wide format. while r can transform large dataset, postgresql can stack years in union query engine , pass resultset r. note data types such geometry objects , arrays may not convert r data types, remove them or convert them string/numeric types.

below such sql union query stacked years. not quite sure mean ibase_id , ibase_dist or "multiplying" aspect year column added corresponding b column. have r script call via rpostgresql module.

import("rpostgresql")  # create connection      drv <- dbdriver("postgresql") con <- dbconnect(drv, dbname = "postgres",                  host = "localhost", port = ####,                  user = "username", password = "password")  strsql <- "select '2000' year,  gid, gembez, gemname, krsbez,                  krsname, pv, gr, capac, dist, gemewz, n500                  popden, instp, b2000 b, (1/ibase_dist) ibase            public.buildings            inner join                 (select a.gid buildid,                          sum(round(st_distance(                                               st_transform(a.centroid, 31467),                                                 st_transform(b.centroid, 31467)                                   )::integer)) ibase_dist                buildings                left join buildings b                        on st_dwithin(st_transform(a.centroid, 31467),                                      st_transform(b.centroid, 31467), 500.0)                     , a.gid <> b.gid                b.pv=true , b.instp < a.instp                group a.gid) distsum            on public.buildings.gid = distsum.buildid            public.buildings.instp = 2000             union             ...other select statements years 2001-2010..."                # import query resultset dataframe df <- dbgetquery(con, strsql)  # close connection dbdisconnect(con)

but sure have necessary ram operation of big dataset. might need allocate memory accordingly. alternatively, can iteratively append each year's select statement growing dataframe object instead of loading @ once.

# ...same connection setup above...  years = c('2000', '2001', '2002', '2003', '2004', '2005',            '2006', '2007', '2008', '2009', '2010')  # creates list of year data frame dflist = lapply(years,                  function(y) {                 # notice concatenation of y in select statement                  strsql <- paste0("select '", y, "' year,  gid, gembez, gemname, krsbez,                                          krsname, pv, gr, capac, dist, gemewz, n500,                                           popden, instp, b", y, ", b, (1/ibase_dist) ibase,                                    public.buildings                                   inner join                                     (select a.gid buildid,                                            sum(round(st_distance(                                               st_transform(a.centroid, 31467),                                                 st_transform(b.centroid, 31467)                                           )::integer)) ibase_dist                                      buildings                                      left join buildings b                                       on st_dwithin(st_transform(a.centroid, 31467),                                                     st_transform(b.centroid, 31467), 500.0)                                      , a.gid <> b.gid                                      b.pv=true , b.instp < a.instp                                      group a.gid) distsum                                   on public.buildings.gid = distsum.buildid                                   public.buildings.instp =", y)                 dbgetquery(con, strsql)                                                })  # append list of data frames 1 large data frame               df <- do.call(rbind, dflist)  # remove previous list memory resources rm(dflist)  # close connection dbdisconnect(con)

Search This Blog

Ben