r - Scraping first two columns of web tables with xml2 -
i have struggled use xml package in r , need scraping well-formatted tables xml2.
the url first page of tables i'd scrape here. on pages want second , third tables, on others want first , second. common thread want tables 'caption' tag includes text 'that meet' scraped , stored in 1 list, , tables 'caption' tag inclues text 'that not meet any'. don't know how @ that. code working follows. can imagine there must sort way make regexp condition select whole table. hope code works.
#define urls urls<-lapply(seq(1,12, 1), function(x) paste('http://www.chemicalsubstanceschimiques.gc.ca/challenge-defi/batch-lot-',x,'/index-eng.php', sep='')) #scrap text batches<-lapply(urls, function(x) read_html(x)) #return tables each batches_tables<-lapply(batches, function(x) xml_find_all(x, './/table')) #get table first out<-batches[[1]] #inspect out[[1]] #do not want table out[[2]] #want table pasted in 1 list, caption='that meet' out[[2]] #want table pasted in second list, caption='that not meet'
target caption
tag using contains()
move parent:
library(xml2) library(rvest) url <- "http://www.chemicalsubstanceschimiques.gc.ca/challenge-defi/batch-lot-1/index-eng.php#s1" pg <- read_html(url) html_nodes(pg, xpath=".//table/caption[contains(., 'that meet')]/..") ## {xml_nodeset (1)} ## [1] <table class="fontsize80"> \n <caption> \n ... html_nodes(pg, xpath=".//table/caption[contains(., 'that not meet')]/..") ## {xml_nodeset (1)} ## [1] <table class="fontsize85"> \n <caption> \n ...
Comments
Post a Comment