python 3.x - Web scraping multiple pages with Beautiful Soup -

i writing code scrape data crowdcube.

the idea information title, description, target capital, raised capital , category

first made attempt on single page. code worked. here is:

from bs4 import beautifulsoup import urllib, re  data = {         'title' : [],         'description' : [],         'target' : [],         'raised':[],         'category' : [] }  l=urllib.request.urlopen('https://www.crowdcube.com/investment/primo-18884')     tree= beautifulsoup(l, 'lxml')  #title     title=tree.find_all('div',{'class':'cc-pitch__title'})      data['title'].append(title[0].find('h2').get_text())       #description     description=tree.find_all('div',{'class':'fullwidth'})      data['description'].append(description[1].find('p').get_text())  #target      target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'})      data['target'].append(target[0].find('dd').get_text())  #raised      raised=tree.find_all('div',{'class':'cc-pitch__raised'})      data['raised'].append(raised[0].find('b').get_text())   #category      category=tree.find_all('li',{'class':'sectors'})      data['category'].append(category[0].find('span').get_text() )  data

i need download same information projects on website.

all links included in page: (https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7)

to so, started creating list of urls code:

source= urllib.request.urlopen('https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7')  get_link= beautifulsoup(source, 'lxml')  links_page = [a.attrs.get('href') in get_link.select('a[href]')]  links_page = list(set(links_page)) #drops duplicates links = [l l in links_page if 'https://www.crowdcube.com/investment/' in l] # drop corrupted links

this example of links code:

 ['https://www.crowdcube.com/investment/floodkit-16516',  'https://www.crowdcube.com/investment/east-end-manufacturing-14667',  'https://www.crowdcube.com/investment/wrap-it-up-18021']

once having list thought run loop same code of above. thus:

for link in links:     l=urllib.request.urlopen(link)     tree= beautifulsoup(l, 'lxml')   #title     title=tree.find_all('div',{'class':'cc-pitch__title'})      data['title'].append(title[0].find('h2').get_text())      #description     description=tree.find_all('div',{'class':'fullwidth'})      data['description'].append(description[1].find('p').get_text())  #target      target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'})      data['target'].append(target[0].find('dd').get_text())  #raised      raised=tree.find_all('div',{'class':'cc-pitch__raised'})      data['raised'].append(raised[0].find('b').get_text())   #category      category=tree.find_all('li',{'class':'sectors'})      data['category'].append(category[0].find('span').get_text() )  data

this not work. tried see tree created @ first iteration , empty.

maybe problem related fact links strings?

there way more 3 links on page linked to, 292, if want parse each of following:

import requests bs4 import beautifulsoup  url = "https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7"   def parse(so):     return {'title': soup.title.text, 'description': so.find("div", {"class": "pitch-tabs"}).p.text,             'target': so.find("div",{"class":"cc-pitch__stats clearfix"}).dd.text,             'raised': so.find("div", {"class": "cc-pitch__raised"}).b.text,             'category': " ".join(so.find("li",{"class":"sectors"}).span.text.split()),             "title": so.title.text}   req = requests.get(url)  soup = beautifulsoup(req.content)  links = {h.a["href"] h in soup.find_all("h2", {"class": "pitch__title"})}  link in links:     print(link)     soup = beautifulsoup(requests.get(link).content)     print(parse(soup))

a snippet of output:

https://www.crowdcube.com/investment/property-moose-14045 {'category': u'other, internet business, technology', 'raised': u'\xa3169,010', 'target': u'\xa360,000', 'description': u'property moose new generation of property investment \u2013 taking equity crowdfunding model , using allow users invest in wide range of properties \xa3500. combining integrated online platform, property moose aspires take crowdfunding revolution storm.', 'title': u'property moose raising \xa360,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/easyproperty-com-16655 {'category': u'professional , business services, internet business', 'raised': u'\xa31,358,680', 'target': u'\xa31,000,000', 'description': u'easyproperty, latest company easygroup, offer individually priced property services. venture, has been founded sir stelios (founder of easyjet) , robert ellice (a property entrepreneur 20 years\u2019 experience), has been described ft \u201ceasily biggest brand name yet enter online estate agent business\u201d.', 'title': u'easyproperty.com raising \xa31,000,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/universal-fuels-phase-1-10466 {'category': u'oil & gas', 'raised': u'\xa3100,000', 'target': u'\xa3100,000', 'description': u'universal fuels ltd on 2 years old, supply diesel, petrol, lubricants , kerosene uk wide homes, petrol stations, transport companies, construction firms , range of other businesses. have been\u2026', 'title': u'universal fuels phase 1 raising \xa3100,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/stakis-daycare-nurseries-ltd-12468 {'category': u'education, other', 'raised': u'\xa3101,230', 'target': u'\xa3100,000', 'description': u'stakis daycare nurseries new franchise provider of daycare nurseries in uk.', 'title': u'stakis daycare nurseries ltd raising \xa3100,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/bidstack-20749 {'category': u'media , creative services, internet business, technology', 'raised': u'\xa3138,970', 'target': u'\xa3100,000', 'description': u"bidstack live bidding platform last-minute digital advertising signage, aiming make digital out of home advertising accessible anyone. bidstack launched video @ o2 arena, raising brand awareness first steps disrupt growing \xa3multi-billion industry. team's experience includes \xa3multi-million business exit , overfunded crowdcube campaign.", 'title': u'bidstack raising \xa3100,000 investment on crowdcube. capital @ risk.'} https://www.crowdcube.com/investment/e-sign-14248 {'category': u'internet business', 'raised': u'\xa364,760', 'target': u'\xa350,000', 'description': u'e-sign offers our clients secure, advanced electronic signature solution enable important documents signed, when required, person, anywhere, @ time. traditional hand written signatures on documents can expensive, time consuming , provide opportunity signature forged. e-sign allows companies conclude business more rapidly, whilst reducing running costs , combating fraud.', 'title': u'e-sign raising \xa350,000 investment on crowdcube. capital @ risk.'}

Search This Blog

Ben

python 3.x - Web scraping multiple pages with Beautiful Soup -

Comments

Post a Comment

Popular posts from this blog

routing - AngularJS State management ->load multiple states in one page -

python - GRASS parser() error -

Swift game error message -