mohi planet: data-warehousing

Showing posts with label data-warehousing. Show all posts

Installing Arelle 32 bit on Windows 7

In web sites you may not find old 32 version. But if are running 32 bit windows version then you can download and install it from past versions of the website :)
Download it from here
https://web.archive.org/web/20140731125505/http://arelle.org/wordpress/wp-content/uploads/downloads/2013/08/arelle-win-x86-2013-08-15.exe

After installing you can update at menu:
Help -> Check for Updates

Start at:
File-> Open Web -> SEC RSS

This will fetch latest SEC filings.
For openning an xbrl form:
File->Open Web -> SEC RSS
which will load recent RSS feed for filings
Right clicking on any one of filings and selecting 'Filing->Open Instance Document' will open load and display every part of the filing

Related:
http://mohiplanet.blogspot.com/2016/02/getting-started-with-open-source-xbrl.html

Getting started with open source xbrl platform Arelle on CentOS

Download:
Download Arelle red hat distribution from http://arelle.org/downloads

http://arelle.org/downloads/16

Install Arelle:
Extract:

tar -zxvf arelle-redhat-x86_64-2014-12-31.tar.gz

Move into directory:

cd arelle-redhat-x86_64-2014-12-31

run the following to verify & see help:

./arelleCmdLine -h

Usage: arelleCmdLine [options]

Options:

--version show program's version number and exit

-h, --help show this help message and exit

-f ENTRYPOINTFILE, --file=ENTRYPOINTFILE

FILENAME is an entry point, which may be an XBRL

instance, schema, linkbase file, inline XBRL instance,

testcase file, testcase index file. FILENAME may be a

local file or a URI to a web located file.

--username=USERNAME user name if needed (with password) for web file

retrieval

--password=PASSWORD password if needed (with user name) for web retrieval

-i IMPORTFILES, --import=IMPORTFILES

FILENAME is a list of files to import to the DTS, such

as additional formula or label linkbases. Multiple

file names are separated by a '|' character.

-d DIFFFILE, --diff=DIFFFILE

FILENAME is a second entry point when comparing

(diffing) two DTSes producing a versioning report.

-r VERSREPORTFILE, --report=VERSREPORTFILE

FILENAME is the filename to save as the versioning

report.

-v, --validate Validate the file according to the entry file type.

If an XBRL file, it is validated according to XBRL

validation 2.1, calculation linkbase validation if

either --calcDecimals or --calcPrecision are

specified, and SEC EDGAR Filing Manual (if --efm

selected) or Global Filer Manual disclosure system

validation (if --gfm=XXX selected). If a test suite or

testcase, the test case variations are individually so

validated. If formulae are present they will be

validated and run unless --formula=none is specified.

--calcDecimals Specify calculation linkbase validation inferring

decimals.

--calcPrecision Specify calculation linkbase validation inferring

precision.

--efm Select Edgar Filer Manual (U.S. SEC) disclosure system

validation (strict).

--disclosureSystem=DISCLOSURESYSTEMNAME

Specify a disclosure system name and select disclosure

system validation. Enter --disclosureSystem=help for

list of names or help-verbose for list of names and

descriptions.

--hmrc Select U.K. HMRC disclosure system validation.

--utr Select validation with respect to Unit Type Registry.

--utrUrl=UTRURL Override disclosure systems Unit Type Registry

location (URL or file path).

--infoset Select validation with respect testcase infosets.

--labelLang=LABELLANG

Language for labels in following file options

(override system settings)

--labelRole=LABELROLE

Label role for labels in following file options

(instead of standard label)

--DTS=DTSFILE, --csvDTS=DTSFILE

Write DTS tree into FILE (may be .csv or .html)

--facts=FACTSFILE, --csvFacts=FACTSFILE

Write fact list into FILE

--factListCols=FACTLISTCOLS

Columns for fact list file

--factTable=FACTTABLEFILE, --csvFactTable=FACTTABLEFILE

Write fact table into FILE

--concepts=CONCEPTSFILE, --csvConcepts=CONCEPTSFILE

Write concepts into FILE

--pre=PREFILE, --csvPre=PREFILE

Write presentation linkbase into FILE

--cal=CALFILE, --csvCal=CALFILE

Write calculation linkbase into FILE

--dim=DIMFILE, --csvDim=DIMFILE

Write dimensions (of definition) linkbase into FILE

--formulae=FORMULAEFILE, --htmlFormulae=FORMULAEFILE

Write formulae linkbase into FILE

--viewArcrole=VIEWARCROLE

Write linkbase relationships for viewArcrole into

viewFile

--viewFile=VIEWFILE Write linkbase relationships for viewArcrole into

viewFile

--roleTypes=ROLETYPESFILE

Write defined role types into FILE

--arcroleTypes=ARCROLETYPESFILE

Write defined arcrole types into FILE

--testReport=TESTREPORT, --csvTestReport=TESTREPORT

Write test report of validation (of test cases) into

FILE

--testReportCols=TESTREPORTCOLS

Columns for test report file

--rssReport=RSSREPORT

Write RSS report into FILE

--rssReportCols=RSSREPORTCOLS

Columns for RSS report file

--skipDTS Skip DTS activities (loading, discovery, validation),

useful when an instance needs only to be parsed.

--skipLoading=SKIPLOADING

Skip loading discovered or schemaLocated files

matching pattern (unix-style file name patterns

separated by '|'), useful when not all linkbases are

needed.

--logFile=LOGFILE Write log messages into file, otherwise they go to

standard output. If file ends in .xml it is xml-

formatted, otherwise it is text.

--logFormat=LOGFORMAT

Logging format for messages capture, otherwise default

is "[%(messageCode)s] %(message)s - %(file)s".

--logLevel=LOGLEVEL Minimum level for messages capture, otherwise the

message is ignored. Current order of levels are

debug, info, info-semantic, warning, warning-semantic,

warning, assertion-satisfied, inconsistency, error-

semantic, assertion-not-satisfied, and error.

--logLevelFilter=LOGLEVELFILTER

Regular expression filter for logLevel. (E.g., to not

match *-semantic levels,

logLevelFilter=(?!^.*-semantic$)(.+).

--logCodeFilter=LOGCODEFILTER

Regular expression filter for log message code.

--parameters=PARAMETERS

Specify parameters for formula and validation

(name=value[,name=value]).

--parameterSeparator=PARAMETERSEPARATOR

Specify parameters separator string (if other than

comma).

--formula=FORMULAACTION

Specify formula action: validate - validate only,

without running, run - validate and run, or none -

prevent formula validation or running when also

specifying -v or --validate. if this option is not

specified, -v or --validate will validate and run

formulas if present

--formulaParamExprResult

Specify formula tracing.

--formulaParamInputValue

Specify formula tracing.

--formulaCallExprSource

Specify formula tracing.

--formulaCallExprCode

Specify formula tracing.

--formulaCallExprEval

Specify formula tracing.

--formulaCallExprResult

Specify formula tracing.

--formulaVarSetExprEval

Specify formula tracing.

--formulaVarSetExprResult

Specify formula tracing.

--formulaVarSetTiming

Specify showing times of variable set evaluation.

--formulaAsserResultCounts

Specify formula tracing.

--formulaSatisfiedAsser

Specify formula tracing.

--formulaUnsatisfiedAsser

Specify formula tracing.

--formulaUnsatisfiedAsserError

Specify formula tracing.

--formulaFormulaRules

Specify formula tracing.

--formulaVarsOrder Specify formula tracing.

--formulaVarExpressionSource

Specify formula tracing.

--formulaVarExpressionCode

Specify formula tracing.

--formulaVarExpressionEvaluation

Specify formula tracing.

--formulaVarExpressionResult

Specify formula tracing.

--formulaVarFilterWinnowing

Specify formula tracing.

--formulaVarFiltersResult

Specify formula tracing.

--formulaRunIDs=FORMULARUNIDS

Specify formula/assertion IDs to run, separated by a

'|' character.

--uiLang=UILANG Language for user interface (override system settings,

such as program messages). Does not save setting.

--proxy=PROXY Modify and re-save proxy settings configuration.

Enter 'system' to use system proxy setting, 'none' to

use no proxy, 'http://[user[:password]@]host[:port]'

(e.g., http://192.168.1.253, http://example.com:8080,

http://joe:secret@example.com:8080), or 'show' to

show current setting, .

--internetConnectivity=INTERNETCONNECTIVITY

Specify internet connectivity: online or offline

--internetTimeout=INTERNETTIMEOUT

Specify internet connection timeout in seconds (0

means unlimited).

--internetRecheck=INTERNETRECHECK

Specify rechecking cache files (weekly is default)

--internetLogDownloads

Log info message for downloads to web cache.

--xdgConfigHome=XDGCONFIGHOME

Specify non-standard location for configuration and

cache files (overrides environment parameter

XDG_CONFIG_HOME).

--plugins=PLUGINS Modify plug-in configuration. Re-save unless 'temp'

is in the module list. Enter 'show' to show current

plug-in configuration. Commands show, and module urls

are '|' separated: +url to add plug-in by its url or

filename, ~name to reload a plug-in by its name, -name

to remove a plug-in by its name, relative URLs are

relative to installation plug-in directory, (e.g.,

'+http://arelle.org/files/hello_web.py', '+C:\Program

Files\Arelle\examples\plugin\hello_dolly.py' to load,

or +../examples/plugin/hello_dolly.py for relative use

of examples directory, ~Hello Dolly to reload, -Hello

Dolly to remove). If + is omitted from .py file

nothing is saved (same as temp). Packaged plug-in

urls are their directory's url.

--packages=PACKAGES Modify taxonomy packages configuration. Re-save

unless 'temp' is in the module list. Enter 'show' to

show current packages configuration. Commands show,

and module urls are '|' separated: +url to add package

by its url or filename, ~name to reload package by its

name, -name to remove a package by its name, URLs are

full absolute paths. If + is omitted from package

file nothing is saved (same as temp).

--packageManifestName=PACKAGEMANIFESTNAME

Provide non-standard archive manifest file name

pattern (e.g., *taxonomyPackage.xml). Uses unix file

name pattern matching. Multiple manifest files are

supported in archive (such as oasis catalogs).

(Replaces search for either .taxonomyPackage.xml or

catalog.xml).

--abortOnMajorError Abort process on major error, such as when load is

unable to find an entry or discovered file.

--showEnvironment Show Arelle's config and cache directory and host OS

environment parameters.

--collectProfileStats

Collect profile statistics, such as timing of

validation activities and formulae.

--webserver=WEBSERVER

start web server on host:port[:server] for REST and

web access, e.g., --webserver locahost:8080, or

specify nondefault a server name, such as cherrypy,

--webserver locahost:8080:cherrypy. (It is possible to

specify options to be defaults for the web server,

such as disclosureSystem and validations, but not

including file names.)

--store-to-XBRL-DB=STOREINTOXBRLDB

Store into XBRL DB. Provides connection string: host,

port,user,password,database[,timeout[,{postgres|rexste

r|rdfDB}]]. Autodetects database type unless 7th

parameter is provided.

--load-from-XBRL-DB=LOADFROMXBRLDB

Load from XBRL DB. Provides connection string: host,p

ort,user,password,database[,timeout[,{postgres|rexster

|rdfDB}]]. Specifies DB parameters to load and

optional file to save XBRL into.

-a, --about Show product version, copyright, and license.

[ make sure you have installed python and pg8000 ] run:

pip install pg8000

Creating the database:

Download database from here: https://github.com/Arelle/Arelle/blob/master/arelle/plugin/xbrlDB/xbrlSemanticPostgresDB.ddl:

wget --no-check-certificate https://github.com/Arelle/Arelle/blob/master/arelle/plugin/xbrlDB/xbrlSemanticPostgresDB.ddl

su - postgres
create database sec

Run DDL script on 'sec' database:

psql -h HOST -U USERNAME -d sec -a -f xbrlSemanticPostgresDB.ddl

Add plugins:
see all installed plugins:

./arelleCmdLine --plugins show

Should produce something like:
[info] Plug-in modules: -
[info] Plug-in: XBRL Database; author: Mark V Systems Limited; version: 0.9; status: enabled; date: 2014-12-09T04:41:53 UTC; description: This plug-in implements the XBRL Public Postgres, Abstract Model and DPM Databases. ; license Apache-2 (Arelle plug-in), BSD license (pg8000 library). - xbrlDB
Install plugin xbrlDB:

./arelleCmdLine --plugins +xbrlDB

should print:
[info] Addition of plug-in XBRL Database successful. - xbrlDB
Scrape, parse and populate SEC filing data into DB:
run:

./arelleCmdLine -f https://www.sec.gov/Archives/edgar/xbrlrss.all.xml -v --store-to-XBRL-DB "HOST,5432,usrname,password,sec,120,pgSemantic"

This will download latest 100 SEC filings and store tag based xbrl data at database HOST:5432/sec

To download,parse and store all SEC filing for a single month say 2016 January we can simple run:

./arelleCmdLine -f https://www.sec.gov/Archives/edgar/monthly/xbrlrss-2016-01.xml
-v --store-to-XBRL-DB "HOST,5432,username,password,sec,120,pgSemantic"

Querying the database:
Say for a single company central index key 0001372183 the following query will get all tag based financial information:

select distinct period.end_date from aspect,data_point,period,entity_identifier where aspect.aspect_id=data_point.aspect_id and data_point.report_id=entity_identifier.report_id and period.period_id=data_point.period_id and entity_identifier.identifier='0001372183';

Sample output:
dei_AmendmentFlag,false,2015-12-01
dei_CurrentFiscalYearEndDate,--02-29,2015-12-01
dei_DocumentFiscalPeriodFocus,Q3,2015-12-01
dei_DocumentFiscalYearFocus,2016,2015-12-01
dei_DocumentPeriodEndDate,2015-11-30,2015-12-01
dei_DocumentType,10-Q,2015-12-01
dei_EntityCentralIndexKey,0001372183,2015-12-01
dei_EntityCommonStockSharesOutstanding,5491753,2016-01-19
dei_EntityFilerCategory,Smaller Reporting Company,2015-12-01
dei_EntityRegistrantName,"Monaker Group, Inc.",2015-12-01
dei_TradingSymbol,MKGI,2015-12-01
invest_InvestmentWarrantsExercisePrice,0.05,2012-08-22
invest_InvestmentWarrantsExercisePrice,3,2009-03-01
mkgi_AdditionalExpendituresForCostsAssociatedWithEmploymentWebsite,10000,2015-12-01
mkgi_AdditionalOffsettingRentExpenseMonthly,2500,2015-12-01
mkgi_AdvancesConversionConvertedIntoPromissoryNoteAmount,70000,2011-04-14
mkgi_AdvancesToFormerSubsidiary,75000,2015-12-01
mkgi_AssetsImpairmentChargesShares,0,2015-12-01
mkgi_AssignmentOfPrincipalToNonRelatedParty,225000,2012-02-16
mkgi_CarryingValueOfBusinessAfterAdjustments,7811286,2014-11-01
mkgi_CarryingValueOfBusinessAfterAudit,1556098,2014-11-01
mkgi_CashFromMerger,0,2014-12-01
mkgi_CashFromMerger,56902,2015-12-01
...............................................................
...............................................................
...............................................................

Full Output File:
https://www.dropbox.com/s/m9finc2x88k75jl/sec-filing-1372183.csv?dl=0
Here is a list of all us-gaap tags with descriptions:
https://github.com/ifanchu/pyXBRL/blob/master/us-gaap/concepts_2014.csv
Cleaning up:
By default arelle puts every downloaded filings at:
/root/.config/arelle/cache/

Sp periodically you may need cleaning up:

rm -rf /root/.config/arelle/cache/*

References:
http://arelle.org/documentation/xbrl-database/
http://www.openfiling.info/wp-content/upLoads/data/ArelleUsersManual.pdf
http://arelle.org/wordpress/wp-content/uploads/downloads/2011/09/ComparabilityAndDataMiningUnifiedModel-Paper.pdf

Related:
http://mohiplanet.blogspot.com/2016/02/installing-arelle-32-bit-on-windows-7.html

Getting started with Python, Web Scraping, MS SQL Server, Windows with a web crawler

For getting started install python 2.7 on win7 with this *.bat script here:
http://mohiplanet.blogspot.com/2015/12/install-python-on-windows-7-scriptbat.html

Download SQL Server 2005 :
https://www.microsoft.com/en-us/download/details.aspx?id=21844
SQL Server 2005 Management Studio :
www.microsoft.com/en-us/download/details.aspx?id=8961
If you are used to with terminal you can rather install command line client rather than visual management studio:
https://www.microsoft.com/en-us/download/details.aspx?id=36433

Make sure you have enabled Administrator mode.

After installation has completed checkout the commandline tool:

sqlcmd -S .\SQLEXPRESS
create some_db
go
use some_db
go
select * from some_table
go

Scraping FEC(Federal Election Commission) Filings (Getting started with a simple crawler) :
Download a sample scraper which downloads all Federal Election Commission electronic filings:

git clone https://github.com/cschnaars/FEC-Scraper/
cd FEC-Scraper

Load FEC sql database into sql server through script:

sqlcmd -S .\SQLEXPRESS
create database FEC
go
exit
sqlcmd -S .\SQLEXPRESS -i FECScraper.sql
go

Setup connection string in both of FECScraper.py and FECParser.py as follows:

connstr = 'DRIVER={SQL Server};SERVER=.\SQLEXPRESS;DATABASE=FEC;UID=;PWD=;'

create the following directories for convenience of the crawler:

mkdir C:\Data\
mkdir C:\Data\Python
mkdir C:\Data\Python\FEC
mkdir C:\Data\Python\FEC\Import
mkdir C:\Data\Python\FEC\Review
mkdir C:\Data\Python\FEC\Processed
mkdir C:\Data\Python\FEC\Output

In case you can't find any data filings:
Check out this working code:
https://drive.google.com/file/d/0B5hTtesq_tWdZFo3eThQRzY3aEU/view?usp=sharing
as last time I had to change one CSS Query from "Form F3" to"F3" in FECScraper.py

Check a sample commitee for downloading specific filings:
Add one committe id
commidappend.txt content:

echo C00494393 > commidappend.txt

--------------------------------------------------------------------------------------------------------------

Doing more on scraping FEC filings :

The latest FEC scraper supports all FEC filings from v1 to v8.1 :

it has 8.1 filing version support:

git clone https://github.com/cschnaars/FEC-Scraper-Toolbox
cd FEC-Scraper-Toolbox
:: make sure you create following directories
mkdir C:\Data\FEC\Master
mkdir C:\Data\FEC\Master\Archive
mkdir C:\Data\FEC\Reports\ErrorLogs
mkdir C:\Data\FEC\Reports\Hold
mkdir C:\Data\FEC\Reports\Output
mkdir C:\Data\FEC\Reports\Processed
mkdir C:\Data\FEC\Reports\Review
mkdir C:\Data\FEC\Reports\Import
mkdir C:\Data\FEC\Archives\Processed
mkdir C:\Data\FEC\Archives\Import
:: run the update_master_files.py which download all committees lists along with
:: tons of other info.
python update_master_files.py
:: run this for downloading daily filings
python download_reports.py
:: run this for parsing and mapping the filing data into database
python parse_reports.py
:: make sure to running the db sql script first
:: https://drive.google.com/file/d/0B5hTtesq_tWdYUVRSzNCcHlJYjA/view?usp=sharing
:: and Import directory has *.fec files and not downloaded *.zip files

Please see this if you dont find any of this commands above not installed :
http://mohiplanet.blogspot.com/2015/10/convert-windows-command-prompt-to-linux.html

References:
https://s3.amazonaws.com/NICAR2015/FEC/MiningFECData.pdf

Getting started with web crawling with Ruby 2 on CentOS

Default previously installed ruby version with CentOS 6 may create lots of setup issues while installing crawler packages ( gems ). Following are a series of scripts that will help us installing a fresh copy of Ruby 2.2.3. Please ignore the slashes('#') and shell script comment blocks (:<<'END' and END) as they will help you automatically comment out additional text messages added to this script while you select & copy all these codes of the script.

As this will help you automatically ignore comments as you go along and paste all this codes in your terminal and get the job done :)

# Uninstall previous gems :

gem update --system
gem --version
# 2.1.8
gem uninstall --all

# Remove previous ruby installation :

rm -f /usr/bin/ruby
rm -f /usr/local/bin/ruby
yum remove ruby -y
yum remove rubygems
#update yum
yum update -y

# Install Ruby 2 on CentOS 6 :

cd /opt/
#download
wget --no-check-certificate https://ftp.ruby-lang.org/pub/ruby/ruby-2.2.3.tar.gz
#extract
tar xvzf ruby-2.2.3.tar.gz
#remove backup
rm -f ruby-2.2.3.tar.gz
cd ruby-2.2.3
#build
./configure
make
make install
#create symlinks
ln -s /opt/ruby-2.2.3/ruby /usr/bin/ruby
ln -s /opt/ruby-2.2.3/ruby /usr/local/bin/ruby
#check ruby version
ruby --version
#should produce something like:

#ruby 2.2.3p173 (2015-08-18 revision 51636) [i686-linux]

#Install updated rubygems:

cd /opt/
#download rubygems 1.8
wget http://production.cf.rubygems.org/rubygems/rubygems-1.8.24.tgz
#extract
tar xvzf rubygems-1.8.24.tgz
#remove backup
rm -f rubygems-1.8.24.tgz
cd rubygems-1.8.24
ruby setup.rb
#check gem version
gem --version
#should produce something like

#1.8.24

#check Ruby REPL version
irb --version
#should produce something like

#irb 0.9.5(05/04/13)

#install a sample crawler package
gem install fech
#see installed crawler package version
gem list fech
#should produce something like this:

#*** LOCAL GEMS ***

#fech(1.8)

# Run the ruby REPL:

irb
#Checkout Helloworld!
puts 'Helloworld'
#run the follwing lines in REPL and check out crawled data
# by installed FEC crawler package at:
#/tmp/723604.fec
filing = Fech::Filing.new(723604)
filing.download

# See properties and methods of a ruby object:

filing.inspect
#will print all properties of this object
: <<'END'
<Fech::Filing:0x29ba8b4 @filing_id=1029398, @download_dir=\"/tmp\", @translator=nil, @quote_char=\"\\\"\", @csv_parser=Fech::Csv, @resaved=false, @customized=false, @encoding=\"iso-8859-1:utf-8\">"
END
filing.methods.sort
#will print properties + all methods as well
: <<'END'
[:!, :!=, :!~, :<=>, :==, :===, :=~, :__id__, :__send__, :amendment?, :amends, :class, :clone, :custom_file_path, :define_singleton_method, :delimiter, :display, :download, :download_dir, :download_dir=, :dup, :each_row, :each_row_with_index, :enum_for, :eql?, :equal?, :extend, :file_contents, :file_name, :file_path, :filing_id, :filing_id=, :filing_url, :filing_version, :fix_f99_contents, :form_type, :freeze, :frozen?, :hash, :hash_zip, :header, :inspect, :instance_eval, :instance_exec, :instance_of?, :instance_variable_defined?, :instance_variable_get, :instance_variable_set, :instance_variables, :is_a?, :itself, :kind_of?, :map, :map_for, :mappings, :method, :methods, :nil?, :object_id, :parse_filing_version, :parse_row?, :private_methods, :protected_methods, :public_method, :public_methods, :public_send, :readable?, :remove_instance_variable, :resave_f99_contents, :respond_to?, :rows_like, :send, :singleton_class, :singleton_method, :singleton_methods, :summary, :taint, :tainted?, :tap, :to_enum, :to_s, :translate, :translator, :trust, :untaint, :untrust, :untrusted?]
END

Be sure to download wget before running the script

Download script Install Ruby 2.2.3 and fech.sh

A simple crawl script with ruby:

Following is a sample ruby script that downloads all F3P filings from FEC Wensite:

require 'fech'
require 'fileutils'
require 'logger'
# 100MB logger
logger = Logger.new('fec-f3p-filings-downloader.log', 10, 102400000)
# download filings from 2001 to 2015 Nov 13
for i in 11850..1032472
filing = Fech::Filing.new(i)
logger.info("Downloading... #{i}.fec")
filing.download
#TODO: check type
#TODO: delete if not F3P
type = filing.form_type
if type.include? "F3P"
logger.info("filing is F3P type")
#or move to /usr/local/fec-f3p-filings/
logger.info("moving to into filings directory...")
FileUtils.mv("/tmp/#{i}.fec", "/usr/local/fec-f3p-filings/#{i}.fec")
else
logger.info("Form type is #{type}")
logger.info("Deleting... /tmp/#{i}.fec")
FileUtils.rm("/tmp/#{i}.fec")
end
end

mohi planet

Installing Arelle 32 bit on Windows 7

Getting started with open source xbrl platform Arelle on CentOS

Getting started with Python, Web Scraping, MS SQL Server, Windows with a web crawler

Getting started with web crawling with Ruby 2 on CentOS

Search This Blog

Popular Posts