English Datasets

The 10-K Corpus (Extended Version) and Financial Measures of Corporations
# Name Description URL Preview Citaion
1 Original 10-K reports Raw and full 10-K reports that have not been processed yet. Contains 1996.full.tgz to 2013.full.tgz, where you can find all reports in that year named in the format of key-date.full, where key is the CUSIP code for the company (10GB). Link
2 MD&A sections Raw MD&A section from original 10-K reports. Contains 1996.mda.tgz to 2013.mda.tgz, where you can find all reports in that year named in the format of key-date.mda (760MB). Link See above
3 Tokenized MD&A sections Tokenized MD&A section from original 10-K reports. Contains 1996.mda.tgz to 2013.mda.tgz, where you can find all reports in that year named in the format of key-date.mda (533MB). Link See above
4 Logarithm post-event return volatility Contains 1996.logfama.txt to 2013.logfama.txt, where you can find the mapping of key (CUSIP code) and its corresponding post-event volatility (Fama-French 3-factor model) in the following year. Logarithms are calculated using base \(e\). Link See above
5 Logarithm volatility Contains 1996.logvol.[+-]12.txt to 2013.logvol.[+-]12.txt, where you can find the mapping of key and its corresponding stock price volatility (standard deviation) in the following (+12) and preceding (-12) year. Logarithms are calculated using base \(e\). Link See above
6 Abnormal trading volume Contains 1996.abnormal.txt to 2013.abnormal.txt, where you can find the mapping of key and its corresponding abnormal trading volume in that year (see [Loughran and McDonald 2011] for the detailed definition). Link See above
7 Excess return Contains 1996.excess.txt to 2013.excess.txt, where you can find the mapping of key and its corresponding excess return in that year (see [Loughran and McDonald 2011] for the detailed definition). Link See above
8 Meta Information about a report, including the issue date, URL, SEC ID, and the company name. Link See above
9 README Brief guide of the above resources. Link See above

Pre-Trained Word Embeddings Vectors
# Name Description URL Preview Citaion
1 Without POS tag Pre-trained vectors via word2vec (with the CBOW model) trained on the above 10-K Corpus (40,708 reports from 18 years). Embedding dimension is 200. Vector
Binary
2 With POS tag Pre-trained vectors via word2vec (with the CBOW model) trained on the above 10-K Corpus (40,708 reports from 18 years). Embedding dimension is 200. Vector
Binary
See above

Multi-word Expressions (MWE)
# Name Description URL Preview Citaion
1 Label platform An online platform where users can label MWEs in a sentence. Link See link
2 MWE attributes Mark which dictionary category provided by Loughran and McDonald Sentiment Word Lists that a MWE belongs to, only marks by the first letter of the category (4,722 MWEs in total). Link See link See above
3 Label source Original sentences where MWEs are lebeled. Link See link See above
4 MWE and POS labels The first and the second field are the numbering and the raw sentence, respectively. The last field is a JSON, which stores positions of strong as well as weak MWEs and the POS (part-of-speech) tags for each word. Link See link See above

Sentiment Sentences
# Name Description URL Preview Citaion
1 Binary label result Contains the sentences extracted from 10-K Corpus with binary risk labels (2,432 sentences in total). Link See link