ruby - Crawling list of URLs and bypass those with no DNS -
i crawling large list of urls ruby urls have not active , not associated dns. when hit url crawler errors.
require 'rubygems' require 'nokogiri' require 'open-uri' require 'net/http' require 'colorize' url_list = [ 'http://website.com', 'http://website.net' ] url_list.each |url| item = "#{url}" resp = net::http.get_response(uri.parse(item)) case resp.code.to_i when 200 puts "success: #{url}".green when 301..303 new_url = resp['location'] puts "redirect #{url} => #{new_url}".yellow else resp.code end end
when run script , hit bad url receive error this:
/users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: nodename nor servname provided, or not known (socketerror) /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `open' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `block in connect' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/timeout.rb:76:in `timeout' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:878:in `connect' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:863:in `do_start' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:852:in `start' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:583:in `start' /users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:478:in `get_response' spider.rb:808:in `block in <main>' spider.rb:806:in `each' spider.rb:806:in `<main>'
use begin/rescue block rescue error , output error info in red:
url_list = [ 'http://website.com', 'http://sdfasdfwqeasdfasdfr.com', 'http://website.net' ] url_list.each |url| item = "#{url}" begin resp = net::http.get_response(uri.parse(item)) case resp.code.to_i when 200 puts "success: #{url}".green when 301..303 new_url = resp['location'] puts "redirect #{url} => #{new_url}".yellow else resp.code end rescue socketerror => e puts "error: #{url} - #{e}".red end end
the output like:
redirect http://website.com => http://www.website.com/ error: http://sdfasdfwqeasdfasdfr.com - getaddrinfo: nodename nor servname provided, or not known success: http://website.net
Comments
Post a Comment