python 3.x - Web scraping multiple pages with Beautiful Soup -
i writing code scrape data crowdcube.
the idea information title, description, target capital, raised capital , category
first made attempt on single page. code worked. here is:
from bs4 import beautifulsoup import urllib, re data = { 'title' : [], 'description' : [], 'target' : [], 'raised':[], 'category' : [] } l=urllib.request.urlopen('https://www.crowdcube.com/investment/primo-18884') tree= beautifulsoup(l, 'lxml') #title title=tree.find_all('div',{'class':'cc-pitch__title'}) data['title'].append(title[0].find('h2').get_text()) #description description=tree.find_all('div',{'class':'fullwidth'}) data['description'].append(description[1].find('p').get_text()) #target target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'}) data['target'].append(target[0].find('dd').get_text()) #raised raised=tree.find_all('div',{'class':'cc-pitch__raised'}) data['raised'].append(raised[0].find('b').get_text()) #category category=tree.find_all('li',{'class':'sectors'}) data['category'].append(category[0].find('span').get_text() ) data i need download same information projects on website.
all links included in page: (https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7)
to so, started creating list of urls code:
source= urllib.request.urlopen('https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7') get_link= beautifulsoup(source, 'lxml') links_page = [a.attrs.get('href') in get_link.select('a[href]')] links_page = list(set(links_page)) #drops duplicates links = [l l in links_page if 'https://www.crowdcube.com/investment/' in l] # drop corrupted links this example of links code:
['https://www.crowdcube.com/investment/floodkit-16516', 'https://www.crowdcube.com/investment/east-end-manufacturing-14667', 'https://www.crowdcube.com/investment/wrap-it-up-18021'] once having list thought run loop same code of above. thus:
for link in links: l=urllib.request.urlopen(link) tree= beautifulsoup(l, 'lxml') #title title=tree.find_all('div',{'class':'cc-pitch__title'}) data['title'].append(title[0].find('h2').get_text()) #description description=tree.find_all('div',{'class':'fullwidth'}) data['description'].append(description[1].find('p').get_text()) #target target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'}) data['target'].append(target[0].find('dd').get_text()) #raised raised=tree.find_all('div',{'class':'cc-pitch__raised'}) data['raised'].append(raised[0].find('b').get_text()) #category category=tree.find_all('li',{'class':'sectors'}) data['category'].append(category[0].find('span').get_text() ) data this not work. tried see tree created @ first iteration , empty.
maybe problem related fact links strings?
there way more 3 links on page linked to, 292, if want parse each of following:
import requests bs4 import beautifulsoup url = "https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7" def parse(so): return {'title': soup.title.text, 'description': so.find("div", {"class": "pitch-tabs"}).p.text, 'target': so.find("div",{"class":"cc-pitch__stats clearfix"}).dd.text, 'raised': so.find("div", {"class": "cc-pitch__raised"}).b.text, 'category': " ".join(so.find("li",{"class":"sectors"}).span.text.split()), "title": so.title.text} req = requests.get(url) soup = beautifulsoup(req.content) links = {h.a["href"] h in soup.find_all("h2", {"class": "pitch__title"})} link in links: print(link) soup = beautifulsoup(requests.get(link).content) print(parse(soup)) a snippet of output:
https://www.crowdcube.com/investment/property-moose-14045 {'category': u'other, internet business, technology', 'raised': u'\xa3169,010', 'target': u'\xa360,000', 'description': u'property moose new generation of property investment \u2013 taking equity crowdfunding model , using allow users invest in wide range of properties \xa3500. combining integrated online platform, property moose aspires take crowdfunding revolution storm.', 'title': u'property moose raising \xa360,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/easyproperty-com-16655 {'category': u'professional , business services, internet business', 'raised': u'\xa31,358,680', 'target': u'\xa31,000,000', 'description': u'easyproperty, latest company easygroup, offer individually priced property services. venture, has been founded sir stelios (founder of easyjet) , robert ellice (a property entrepreneur 20 years\u2019 experience), has been described ft \u201ceasily biggest brand name yet enter online estate agent business\u201d.', 'title': u'easyproperty.com raising \xa31,000,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/universal-fuels-phase-1-10466 {'category': u'oil & gas', 'raised': u'\xa3100,000', 'target': u'\xa3100,000', 'description': u'universal fuels ltd on 2 years old, supply diesel, petrol, lubricants , kerosene uk wide homes, petrol stations, transport companies, construction firms , range of other businesses. have been\u2026', 'title': u'universal fuels phase 1 raising \xa3100,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/stakis-daycare-nurseries-ltd-12468 {'category': u'education, other', 'raised': u'\xa3101,230', 'target': u'\xa3100,000', 'description': u'stakis daycare nurseries new franchise provider of daycare nurseries in uk.', 'title': u'stakis daycare nurseries ltd raising \xa3100,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/bidstack-20749 {'category': u'media , creative services, internet business, technology', 'raised': u'\xa3138,970', 'target': u'\xa3100,000', 'description': u"bidstack live bidding platform last-minute digital advertising signage, aiming make digital out of home advertising accessible anyone. bidstack launched video @ o2 arena, raising brand awareness first steps disrupt growing \xa3multi-billion industry. team's experience includes \xa3multi-million business exit , overfunded crowdcube campaign.", 'title': u'bidstack raising \xa3100,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/e-sign-14248 {'category': u'internet business', 'raised': u'\xa364,760', 'target': u'\xa350,000', 'description': u'e-sign offers our clients secure, advanced electronic signature solution enable important documents signed, when required, person, anywhere, @ time. traditional hand written signatures on documents can expensive, time consuming , provide opportunity signature forged. e-sign allows companies conclude business more rapidly, whilst reducing running costs , combating fraud.', 'title': u'e-sign raising \xa350,000 investment on crowdcube. capital @ risk.'}
Comments
Post a Comment