Getting started with web crawling with Ruby 2 on CentOS

Default previously installed ruby version with CentOS 6 may create lots of setup issues while installing crawler packages ( gems ). Following are a series of scripts that will help us installing a fresh copy of Ruby 2.2.3. Please ignore the slashes('#') and shell script comment blocks (:<<'END' and END) as they will help you automatically comment out additional text messages added to this script while you select & copy all these codes of the script.

As this will help you automatically ignore comments as you go along and paste all this codes in your terminal and get the job done :)

# Uninstall previous gems :
  1. gem update --system
  2. gem --version
  3. # 2.1.8
  4. gem uninstall --all
# Remove previous ruby installation :
  1. rm -f /usr/bin/ruby
  2. rm -f /usr/local/bin/ruby
  3. yum remove ruby -y
  4. yum remove rubygems
  5. #update yum
  6. yum update -y
# Install Ruby 2 on CentOS 6 :
  1. cd /opt/
  2. #download
  3. wget --no-check-certificate https://ftp.ruby-lang.org/pub/ruby/ruby-2.2.3.tar.gz
  4. #extract
  5. tar xvzf ruby-2.2.3.tar.gz
  6. #remove backup
  7. rm -f ruby-2.2.3.tar.gz
  8. cd ruby-2.2.3
  9. #build
  10. ./configure
  11. make
  12. make install
  13. #create symlinks
  14. ln -s /opt/ruby-2.2.3/ruby /usr/bin/ruby
  15. ln -s /opt/ruby-2.2.3/ruby /usr/local/bin/ruby
  16. #check ruby version
  17. ruby --version
  18. #should produce something like:
  19. #ruby 2.2.3p173 (2015-08-18 revision 51636) [i686-linux]

#Install updated rubygems:
  1. cd /opt/
  2. #download rubygems 1.8
  3. wget http://production.cf.rubygems.org/rubygems/rubygems-1.8.24.tgz
  4. #extract
  5. tar xvzf rubygems-1.8.24.tgz
  6. #remove backup
  7. rm -f rubygems-1.8.24.tgz
  8. cd rubygems-1.8.24
  9. ruby setup.rb
  10. #check gem version
  11. gem --version
  12. #should produce something like
  13. #1.8.24


  1. #check Ruby REPL version
  2. irb --version
  3. #should produce something like
  4. #irb 0.9.5(05/04/13)

  1. #install a sample crawler package
  2. gem install fech
  3. #see installed crawler package version
  4. gem list fech
  5. #should produce something like this:
  6. #
  7. #*** LOCAL GEMS ***
  8. #
  9. #fech(1.8)
  10. #
# Run the ruby REPL:

  1. irb
  2. #Checkout Helloworld!
  3. puts 'Helloworld'
  4. #run the follwing lines in REPL and check out crawled data 
  5. # by installed FEC crawler package at:
  6. #/tmp/723604.fec
  7. filing = Fech::Filing.new(723604)
  8. filing.download

# See properties and methods of a ruby object:
  1. filing.inspect
  2. #will print all properties of this object
  3. : <<'END'
  4. <Fech::Filing:0x29ba8b4 @filing_id=1029398, @download_dir=\"/tmp\", @translator=nil, @quote_char=\"\\\"\", @csv_parser=Fech::Csv, @resaved=false, @customized=false, @encoding=\"iso-8859-1:utf-8\">"
  5. END
  6. filing.methods.sort
  7. #will print properties + all methods as well
  8. : <<'END'
  9. [:!, :!=, :!~, :<=>, :==, :===, :=~, :__id__, :__send__, :amendment?, :amends, :class, :clone, :custom_file_path, :define_singleton_method, :delimiter, :display, :download, :download_dir, :download_dir=, :dup, :each_row, :each_row_with_index, :enum_for, :eql?, :equal?, :extend, :file_contents, :file_name, :file_path, :filing_id, :filing_id=, :filing_url, :filing_version, :fix_f99_contents, :form_type, :freeze, :frozen?, :hash, :hash_zip, :header, :inspect, :instance_eval, :instance_exec, :instance_of?, :instance_variable_defined?, :instance_variable_get, :instance_variable_set, :instance_variables, :is_a?, :itself, :kind_of?, :map, :map_for, :mappings, :method, :methods, :nil?, :object_id, :parse_filing_version, :parse_row?, :private_methods, :protected_methods, :public_method, :public_methods, :public_send, :readable?, :remove_instance_variable, :resave_f99_contents, :respond_to?, :rows_like, :send, :singleton_class, :singleton_method, :singleton_methods, :summary, :taint, :tainted?, :tap, :to_enum, :to_s, :translate, :translator, :trust, :untaint, :untrust, :untrusted?]
  10. END

Be sure to download wget before running the script

A simple crawl script with ruby:


Following is a sample ruby script that downloads all F3P filings from FEC Wensite:
  1. require 'fech'
  2. require 'fileutils'
  3. require 'logger'
  4. # 100MB logger
  5. logger = Logger.new('fec-f3p-filings-downloader.log', 10, 102400000)
  6. # download filings from 2001 to 2015 Nov 13
  7. for i in 11850..1032472
  8. filing = Fech::Filing.new(i)
  9. logger.info("Downloading... #{i}.fec")
  10. filing.download
  11. #TODO: check type
  12. #TODO: delete if not F3P
  13. type = filing.form_type
  14. if type.include? "F3P"
  15. logger.info("filing is F3P type")
  16. #or move to /usr/local/fec-f3p-filings/
  17. logger.info("moving to into filings directory...")
  18. FileUtils.mv("/tmp/#{i}.fec", "/usr/local/fec-f3p-filings/#{i}.fec")
  19. else
  20. logger.info("Form type is #{type}")
  21. logger.info("Deleting... /tmp/#{i}.fec")
  22. FileUtils.rm("/tmp/#{i}.fec")
  23. end
  24. end

No comments:

Post a Comment