ruby - Crawling list of URLs and bypass those with no DNS -


i crawling large list of urls ruby urls have not active , not associated dns. when hit url crawler errors.

require 'rubygems' require 'nokogiri' require 'open-uri' require 'net/http' require 'colorize'  url_list = [   'http://website.com',   'http://website.net' ]  url_list.each |url|   item = "#{url}"   resp = net::http.get_response(uri.parse(item))    case resp.code.to_i   when 200     puts "success: #{url}".green   when 301..303     new_url = resp['location']     puts "redirect #{url} => #{new_url}".yellow   else     resp.code   end end 

when run script , hit bad url receive error this:

/users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: nodename nor servname provided, or not known (socketerror) /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `open' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `block in connect' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/timeout.rb:76:in `timeout' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:878:in `connect' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:863:in `do_start' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:852:in `start' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:583:in `start' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:478:in `get_response' spider.rb:808:in `block in <main>' spider.rb:806:in `each' spider.rb:806:in `<main>' 

use begin/rescue block rescue error , output error info in red:

url_list = [   'http://website.com',   'http://sdfasdfwqeasdfasdfr.com',   'http://website.net' ]  url_list.each |url|   item = "#{url}"    begin     resp = net::http.get_response(uri.parse(item))      case resp.code.to_i     when 200       puts "success: #{url}".green     when 301..303       new_url = resp['location']       puts "redirect #{url} => #{new_url}".yellow     else       resp.code     end   rescue socketerror => e     puts "error: #{url} - #{e}".red   end end 

the output like:

redirect http://website.com => http://www.website.com/ error: http://sdfasdfwqeasdfasdfr.com - getaddrinfo: nodename nor servname provided, or not known success: http://website.net 

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -