python - Displaying contents of web scrape -
the code below displays fields out onto screen.is there way fields "alongside" each other appear in database or in spreadsheet.in source code fields track,date,datetime,grade,distance , prizes found in resultsblockheader div class,and fin(finishing position) greyhound,trap,sp timesec , time distance found in div resultsblock.i trying them displayed track,date,datetime,grade,distance,prizes,fin,greyhound,trap,sp,timesec,timedistance in 1 line.any appreciated.
from urllib import urlopen bs4 import beautifulsoup html = urlopen("http://www.gbgb.org.uk/resultsmeeting.aspx?id=135754") bsobj = beautifulsoup(html, 'lxml') namelist = bsobj. findall("div", {"class": "track"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("div", {"class": "date"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("div", {"class": "datetime"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("div", {"class": "grade"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("div", {"class": "distance"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("div", {"class": "prizes"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("li", {"class": "first essential fin"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("li", {"class": "essential greyhound"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("li", {"class": "trap"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("li", {"class": "sp"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("li", {"class": "timesec"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("li", {"class": "timedistance"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("li", {"class": "essential trainer"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("li", {"class": "first essential comment"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("div", {"class": "resultsblockfooter"}) name in namelist: print(name. get_text()) namelist = bsobj. findall("li", {"class": "first essential"}) name in namelist: print(name. get_text())
first of all, make sure not violating website's terms of use - stay on legal side.
the markup not easy scrape, iterate on race headers , every header, desired information race. then, sibling results block , extract rows. sample code started - extracts track , greyhound:
from pprint import pprint urllib2 import urlopen bs4 import beautifulsoup html = urlopen("http://www.gbgb.org.uk/resultsmeeting.aspx?id=135754") soup = beautifulsoup(html, 'lxml') rows = [] header in soup.find_all("div", class_="resultsblockheader"): track = header.find("div", class_="track").get_text(strip=true) results = header.find_next_sibling("div", class_="resultsblock").find_all("ul", class_="line1") result in results: greyhound = result.find("li", class_="greyhound").get_text(strip=true) rows.append({ "track": track, "greyhound": greyhound }) pprint(rows)
note every row see in tables represented 3 lines in markup:
<ul class="contents line1"> ... </ul> <ul class="contents line2"> ... </ul> <ul class="contents line3"> ... </ul>
the greyhound
value inside first ul
(with line1
class), may need line2
, line3
using result.find_next_sibling("ul", class="line2")
, result.find_next_sibling("ul", class="line3")
.
Comments
Post a Comment