German Adverb-Adjective Phrase Dataset for Compositionality Tests

Lexikalische Ressource

German Adverb-Adjective Phrase Dataset for Compositionality Tests

Titel

German Adverb-Adjective Phrase Dataset for Compositionality Tests eng

Resource_description

If you want to use this dataset for research purposes, please refer to the following sources: - Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP). - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics. The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license. The German 23,488 adverb-adjective phrases (split into 16,441 train, 4,701 test, 2,346 dev instances) were extracted from the TüBa-D/DP treebank, which consists of articles from the newspaper taz, the German Wikipedia dump from January 20, 2018 and the German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012) and has a size of 64.9M sentences and 1.3B tokens. The dataset was constructed with the help of the dependency annotations of the treebank. To collect the adverb-adjective phrases, head-dependent pairs were extracted that fulfilled the following requirements: - the head is an attributive or predicative adjective and governs the dependent with the adverb relation - the dependent immediately precedes the head The extracted word pairs can have as the first element both real adverbs and adjectives which function as an adverb. The train/test/dev files have the following format, the single parts are separated by space. adverb adjective phrase, where the adverb and the adjective in the phrase are separated by the string _adv_adj_ (e.g. immer leer immer_adv_adj_leer). For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition. The word representations were trained on the lemmatized TüBa-D/DP treebank with the word2vec package. The embeddings were constructed using the skip-gram model with negative sampling (Mikolov et al., 2013). The embedding size is 200, context size is a symmetric window of 10 words, 25 negative samples were used and a sample probability of 0.0001. Representations were only trained for words and phrases with a minimum frequency of 30 occurrences. The final vocabulary contains 615,908 words. The resulting embeddings are stored in the binary word2vec format in twe-adv-adj.bin, which can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)). eng

Md_id

https://doi.org/10.57754/FDAT.7casr-x0p36

Md_timestamp

2019-05-01

Lc_version

Tech_landing_page

https://doi.org/10.57754/FDAT.7casr-x0p36

entityId

139bb894-81e4-4c5c-a380-f2d044849b8a

sourceId

8cefa5dd-f5fb-4527-8acb-88cc6824eb48

Lex_size

23488 phrases

Keine verknüpften Ressourcen sind verfügbar!