English Datasets

The 10-K Corpus (Extended Version) and Financial Measures of Corporations

#	Name	Description	URL	Preview	Citaion
1	Original 10-K reports	Raw and full 10-K reports that have not been processed yet. Contains `1996.full.tgz` to `2013.full.tgz`, where you can find all reports in that year named in the format of `key-date.full`, where `key` is the CUSIP code for the company (10GB).	Link	View	TMIS 2016
2	MD&A sections	Raw MD&A section from original 10-K reports. Contains `1996.mda.tgz` to `2013.mda.tgz`, where you can find all reports in that year named in the format of `key-date.mda` (760MB).	Link	View	See above
3	Tokenized MD&A sections	Tokenized MD&A section from original 10-K reports. Contains `1996.mda.tgz` to `2013.mda.tgz`, where you can find all reports in that year named in the format of `key-date.mda` (533MB).	Link	View	See above
4	Logarithm post-event return volatility	Contains `1996.logfama.txt` to `2013.logfama.txt`, where you can find the mapping of key (CUSIP code) and its corresponding post-event volatility (Fama-French 3-factor model) in the following year. Logarithms are calculated using base \(e\).	Link	View	See above
5	Logarithm volatility	Contains `1996.logvol.[+-]12.txt` to `2013.logvol.[+-]12.txt`, where you can find the mapping of key and its corresponding stock price volatility (standard deviation) in the following (+12) and preceding (-12) year. Logarithms are calculated using base \(e\).	Link	View	See above
6	Abnormal trading volume	Contains `1996.abnormal.txt` to `2013.abnormal.txt`, where you can find the mapping of key and its corresponding abnormal trading volume in that year (see [Loughran and McDonald 2011] for the detailed definition).	Link	View	See above
7	Excess return	Contains `1996.excess.txt` to `2013.excess.txt`, where you can find the mapping of key and its corresponding excess return in that year (see [Loughran and McDonald 2011] for the detailed definition).	Link	View	See above
8	Meta	Information about a report, including the issue date, URL, SEC ID, and the company name.	Link	View	See above
9	README	Brief guide of the above resources.	Link	View	See above

Pre-Trained Word Embeddings Vectors

#	Name	Description	URL	Preview	Citaion
1	Without POS tag	Pre-trained vectors via word2vec (with the CBOW model) trained on the above 10-K Corpus (40,708 reports from 18 years). Embedding dimension is 200.	Vector Binary	View	TMIS 2016
2	With POS tag	Pre-trained vectors via word2vec (with the CBOW model) trained on the above 10-K Corpus (40,708 reports from 18 years). Embedding dimension is 200.	Vector Binary	View	See above

Multi-word Expressions (MWE)

#	Name	Description	URL	Preview	Citaion
1	Label platform	An online platform where users can label MWEs in a sentence.	Link	See link	ICASSP 2019
2	MWE attributes	Mark which dictionary category provided by Loughran and McDonald Sentiment Word Lists that a MWE belongs to, only marks by the first letter of the category (4,722 MWEs in total).	Link	See link	See above
3	Label source	Original sentences where MWEs are lebeled.	Link	See link	See above
4	MWE and POS labels	The first and the second field are the numbering and the raw sentence, respectively. The last field is a JSON, which stores positions of strong as well as weak MWEs and the POS (part-of-speech) tags for each word.	Link	See link	See above

Sentiment Sentences

#	Name	Description	URL	Preview	Citaion
1	Binary label result	Contains the sentences extracted from 10-K Corpus with binary risk labels (2,432 sentences in total).	Link	See link	ICASSP 2020