Lexikalische Ressource

English Adverb-Adjective Phrase Dataset for Compositionality Tests

eng English Adverb-Adjective Phrase Dataset for Compositionality Tests

eng If you want to use this dataset for research purposes, please refer to the following sources: - Roland Schäfer. 2015. Processing and querying large web corpora with the COW14 architecture. In Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3), Lancaster. UCREL, IDS. - Roland Schäfer and Felix Bildhauer. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA). - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics. The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license. The 23,148 English adverb-adjective phrases (splits: 16,222 train, 4,618 test, 2,308 dev) were automatically extracted from the ENCOW16AX treebank (Schäfer and Bildhauer, 2012; Schäfer, 2015), which contains crawled web content from different sources. The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases head-dependent pairs were extracted that fulfilled the following requirements: - the head is an attributive or predicative adjective and governs dependent with the adverb relation - the dependent immediately precedes the head The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb. The train/test/dev files have the following format, single parts separated by tab: adverb adjective adv-adj_phrase, where the adverb and adjective in the phrase are separated by the string _adv_adj_ (e.g. extremely simple extremely_adv_adj_simple). For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition. The word embeddings were trained on a subcorpus of the ENCOW16AX treebank, which contains only sentences with a document quality of a or b. The final training corpus for the word embeddings contains 89.0M sentences and 2.2B tokens. To ensure that trained word embedding for enough adverb-adjective phrases are available, the embeddings were trained on word forms, instead of lemmas. The averb-adjective phrases were merged into a single unit for embedding training, the embeddings for the single adverbs and adjectives were trained on the remaining occurrences of the constituents. The embeddings for the adverbs, adjectives and phrases were trained jointly, using the word2vec package (Mikolov et al. 2013). The word embeddings were trained using the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The resulting embeddings have a dimension of 200 and the vocabulary contains 278,345 words. The minimum frequency cut-off was set to 50 for all words and phrases. The word representations are stored in the binary word2vec format in encow-adv-adj.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)).

2019-05-01

1

53b50c8e-af72-4633-843a-ca4e58bf598d

8cefa5dd-f5fb-4527-8acb-88cc6824eb48

23148 phrases

Keine verknüpften Ressourcen sind verfügbar!
Keine verknüpften Ressourcen sind verfügbar!