r - Scraping first two columns of web tables with xml2 -


i have struggled use xml package in r , need scraping well-formatted tables xml2.

the url first page of tables i'd scrape here. on pages want second , third tables, on others want first , second. common thread want tables 'caption' tag includes text 'that meet' scraped , stored in 1 list, , tables 'caption' tag inclues text 'that not meet any'. don't know how @ that. code working follows. can imagine there must sort way make regexp condition select whole table. hope code works.

#define urls urls<-lapply(seq(1,12, 1), function(x) paste('http://www.chemicalsubstanceschimiques.gc.ca/challenge-defi/batch-lot-',x,'/index-eng.php', sep='')) #scrap text batches<-lapply(urls, function(x) read_html(x)) #return tables each  batches_tables<-lapply(batches, function(x) xml_find_all(x, './/table')) #get table first out<-batches[[1]] #inspect out[[1]] #do not want table out[[2]] #want table pasted in 1 list, caption='that meet' out[[2]] #want table pasted in second list, caption='that not meet' 

target caption tag using contains() move parent:

library(xml2) library(rvest)  url <- "http://www.chemicalsubstanceschimiques.gc.ca/challenge-defi/batch-lot-1/index-eng.php#s1" pg <- read_html(url)  html_nodes(pg, xpath=".//table/caption[contains(., 'that meet')]/..") ## {xml_nodeset (1)} ## [1] <table class="fontsize80">&#13;\n          <caption>&#13;\n          ...  html_nodes(pg, xpath=".//table/caption[contains(., 'that not meet')]/..") ## {xml_nodeset (1)} ## [1] <table class="fontsize85">&#13;\n          <caption>&#13;\n          ... 

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -