Getting started with Python, Web Scraping, MS SQL Server, Windows with a web crawler

For getting started install python 2.7 on win7 with this *.bat script here:
http://mohiplanet.blogspot.com/2015/12/install-python-on-windows-7-scriptbat.html

Download SQL Server 2005 :
https://www.microsoft.com/en-us/download/details.aspx?id=21844
SQL Server 2005 Management Studio :
www.microsoft.com/en-us/download/details.aspx?id=8961
If you are  used to with terminal you can rather install command line client rather than visual management studio:
https://www.microsoft.com/en-us/download/details.aspx?id=36433

Make sure you have enabled Administrator mode.

After installation has completed checkout the commandline tool:

  1. sqlcmd -S .\SQLEXPRESS
  2. create some_db
  3. go
  4. use some_db
  5. go
  6. select * from some_table
  7. go
Scraping FEC(Federal Election Commission) Filings (Getting started with a simple crawler) :
Download a sample scraper which downloads all Federal Election Commission electronic filings:
  1. git clone https://github.com/cschnaars/FEC-Scraper/
  2. cd FEC-Scraper
Load FEC sql database into sql server through script:
  1. sqlcmd -S .\SQLEXPRESS
  2. create database FEC
  3. go
  4. exit
  5. sqlcmd -S .\SQLEXPRESS -i FECScraper.sql
  6. go

Setup connection string in both of  FECScraper.py and FECParser.py as follows:
  1. connstr = 'DRIVER={SQL Server};SERVER=.\SQLEXPRESS;DATABASE=FEC;UID=;PWD=;'


create the following directories for convenience of the crawler:
  1. mkdir C:\Data\
  2. mkdir C:\Data\Python
  3. mkdir C:\Data\Python\FEC
  4. mkdir C:\Data\Python\FEC\Import
  5. mkdir C:\Data\Python\FEC\Review
  6. mkdir C:\Data\Python\FEC\Processed
  7. mkdir C:\Data\Python\FEC\Output

In case you can't find any data filings:
Check out this working code:
https://drive.google.com/file/d/0B5hTtesq_tWdZFo3eThQRzY3aEU/view?usp=sharing
as last time I had to change one CSS Query from "Form F3" to"F3" in FECScraper.py

Check a sample commitee for downloading specific filings:
Add one committe id
commidappend.txt content:
  1. echo C00494393 > commidappend.txt




--------------------------------------------------------------------------------------------------------------
Doing more on scraping FEC filings :
The latest FEC scraper supports all FEC filings from v1 to v8.1  : 
it has 8.1 filing version support:
  1. git clone https://github.com/cschnaars/FEC-Scraper-Toolbox
  2. cd FEC-Scraper-Toolbox
  3. :: make sure you create following directories
  4. mkdir C:\Data\FEC\Master
  5. mkdir C:\Data\FEC\Master\Archive
  6. mkdir C:\Data\FEC\Reports\ErrorLogs
  7. mkdir C:\Data\FEC\Reports\Hold
  8. mkdir C:\Data\FEC\Reports\Output
  9. mkdir C:\Data\FEC\Reports\Processed
  10. mkdir C:\Data\FEC\Reports\Review
  11. mkdir C:\Data\FEC\Reports\Import
  12. mkdir C:\Data\FEC\Archives\Processed
  13. mkdir C:\Data\FEC\Archives\Import
  14. :: run the update_master_files.py which download all committees lists along with
  15. :: tons of other info.
  16. python update_master_files.py
  17. :: run this for downloading daily filings
  18. python download_reports.py
  19. :: run this for parsing and mapping the filing data into database
  20. python parse_reports.py
  21. :: make sure to running the db sql script first
  22. :: https://drive.google.com/file/d/0B5hTtesq_tWdYUVRSzNCcHlJYjA/view?usp=sharing
  23. :: and Import directory has *.fec files and not downloaded *.zip files
Please see this if you dont find any of this commands above not installed :
http://mohiplanet.blogspot.com/2015/10/convert-windows-command-prompt-to-linux.html

References:
https://s3.amazonaws.com/NICAR2015/FEC/MiningFECData.pdf

No comments:

Post a Comment