PostgreSQL, R: Multiply all rows of table to create Panel-data (time-series) -
i have table buildings 3.2 million rows. need expand table 11 different periods, handle (balanced) paneldata. means every object there 11 different years (from 2000-2010) observate. periods should called:
2000 2001 ... 2009 2010 table definitions
create table public.buildings ( gid integer not null default nextval('buildings_gid_seq'::regclass), osm_id character varying(11), name character varying(48), type character varying(16), geom geometry(multipolygon,4326), centroid geometry(point,4326), gembez character varying(50), gemname character varying(50), krsbez character varying(50), krsname character varying(50), pv boolean, gr smallint, capac double precision, instdate date, pvid integer, dist double precision, gemewz integer, n500 integer, ibase double precision, popden integer, instp smallint, b2000 double precision, b2001 double precision, b2002 double precision, b2003 double precision, b2004 double precision, b2005 double precision, b2006 double precision, b2007 double precision, b2008 double precision, b2009 double precision, b2010 double precision, ibase_id integer[], ibase_dist integer[], constraint buildings_pkey primary key (gid) ) ( oids=false ); alter table public.buildings owner postgres; create index build_centroid_gix on public.buildings using gist (st_transform(centroid, 31467)); create index buildings_geom_idx on public.buildings using gist (geom); i want use data regression analysis in r.
ibase_id array of gid. ibase_dist related array distances of gid's obejct. length of 2 arrays same.
the gid's in array belong records of buildings, within radius of 500m around centroid,the center of obeject, , have pv=true (which implies dist,instdate,instp,capac&pvid not null).
select a.gid buildid, array_agg(b.gid) ibase_id, array_agg(round(st_distance(st_transform(a.centroid, 31467), st_transform(b.centroid, 31467))::integer)) ibase_dist buildings left join (select * buildings pv=true) b on st_dwithin(st_transform(a.centroid, 31467), st_transform(b.centroid, 31467), 500.0) , a.gid <> b.gid group a.gid example:
ibase_id: {3075528,409073,322311,226643,833798,322344,226609};
ibase_dist {290,293,398,494,411,381,384}
update buildings set ibase=sum(1/s) unnest(select ibasedist buildings (select instp buildings gid in unnest(ibase_id))<year) s for each period, entrys of arrays shall considered, years before observated period of panel-data. (the query above not work, yet, because need concastenate arrays first) right now, 2 arrays hold information of years. that's why thought should added each period of time, after expand panel-data, calculate ibasefor each record (11x 3,2 million).
i don't need of columns regression analysis. in case dramatically improve performance of multiplying, stick rows (basically leaving out geometry columns):
gid integer not null default nextval('buildings_gid_seq'::regclass), gembez character varying(50), gemname character varying(50), krsbez character varying(50), krsname character varying(50), pv boolean, gr smallint, capac double precision, dist double precision, gemewz integer, n500 integer, ibase double precision, popden integer, instp smallint, b2000 double precision, b2001 double precision, b2002 double precision, b2003 double precision, b2004 double precision, b2005 double precision, b2006 double precision, b2007 double precision, b2008 double precision, b2009 double precision, b2010 double precision, ibase_id integer[], ibase_dist integer[], constraint buildings_pkey primary key (gid) ) ( oids=false solution approach
i had basic idea create second table periods containing 11 different periods , multiply table table buildings. not sure how implement this. unfortunately don't have experience r , don't use database interface r, yet.
working postgresql 9.5beta2, compiled visual c++ build 1800, 64-bit , r x64 3.2.1
essentially, panel data set data in long format repeating years each record time column. current structure in wide format. while r can transform large dataset, postgresql can stack years in union query engine , pass resultset r. note data types such geometry objects , arrays may not convert r data types, remove them or convert them string/numeric types.
below such sql union query stacked years. not quite sure mean ibase_id , ibase_dist or "multiplying" aspect year column added corresponding b column. have r script call via rpostgresql module.
import("rpostgresql") # create connection drv <- dbdriver("postgresql") con <- dbconnect(drv, dbname = "postgres", host = "localhost", port = ####, user = "username", password = "password") strsql <- "select '2000' year, gid, gembez, gemname, krsbez, krsname, pv, gr, capac, dist, gemewz, n500 popden, instp, b2000 b, (1/ibase_dist) ibase public.buildings inner join (select a.gid buildid, sum(round(st_distance( st_transform(a.centroid, 31467), st_transform(b.centroid, 31467) )::integer)) ibase_dist buildings left join buildings b on st_dwithin(st_transform(a.centroid, 31467), st_transform(b.centroid, 31467), 500.0) , a.gid <> b.gid b.pv=true , b.instp < a.instp group a.gid) distsum on public.buildings.gid = distsum.buildid public.buildings.instp = 2000 union ...other select statements years 2001-2010..." # import query resultset dataframe df <- dbgetquery(con, strsql) # close connection dbdisconnect(con) but sure have necessary ram operation of big dataset. might need allocate memory accordingly. alternatively, can iteratively append each year's select statement growing dataframe object instead of loading @ once.
# ...same connection setup above... years = c('2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010') # creates list of year data frame dflist = lapply(years, function(y) { # notice concatenation of y in select statement strsql <- paste0("select '", y, "' year, gid, gembez, gemname, krsbez, krsname, pv, gr, capac, dist, gemewz, n500, popden, instp, b", y, ", b, (1/ibase_dist) ibase, public.buildings inner join (select a.gid buildid, sum(round(st_distance( st_transform(a.centroid, 31467), st_transform(b.centroid, 31467) )::integer)) ibase_dist buildings left join buildings b on st_dwithin(st_transform(a.centroid, 31467), st_transform(b.centroid, 31467), 500.0) , a.gid <> b.gid b.pv=true , b.instp < a.instp group a.gid) distsum on public.buildings.gid = distsum.buildid public.buildings.instp =", y) dbgetquery(con, strsql) }) # append list of data frames 1 large data frame df <- do.call(rbind, dflist) # remove previous list memory resources rm(dflist) # close connection dbdisconnect(con)
Comments
Post a Comment