Code for composition models used in "no word is an island" eng
If you want to use this code for research purposes, please refer to the following sources: - Daniël de Kok, Sebastian Pütz. 2019. Stylebook for the Tübingen treebank of dependency-parsed German (TüBa-D/DP). - Corina Dima, Daniël de Kok, Neele Witte, Erhard Hinrichs. 2019. No word is an island — a transformation weighting model for semantic composition. Transactions of the Association for Computational Linguistics. The dataset is distributed under the Creative Commons Attribution NonCommercial (CC-BY-NC) license. The 119,434 German adjective-noun phrases in this dataset (splits: 83,603 train, 23,887 test, 11,944 dev instances) were extracted automatically from the TüBa-D/DP treebank. The treebank is composed of three different parts: 1) articles from the German newspaper taz; 2) the German Wikipedia dump from January 20, 2018; 3) German proceedings from the EuroParl corpus (Koehn, 2005; Tiedemann, 2012). The treebank consists of 64.9M sentences and 1.3B tokens. The train/test/dev files have the following format, single parts are separated by space: adjective noun adj-noun phrase, where the adjective and the noun of the phrase are separated by the string _adj_n_ (e.g. kritisch Film kritisch_adj_n_Film). The phrases were extracted with the part-of-speech tag information provided by the treebank. For results of different composition models on this dataset see Dima et al. (2019), No word is an island — a transformation weighting model for semantic composition. The embeddings for all words and phrases in this dataset are stored in the word2vec format in twe-adj-n.bin. This format can be loaded by several packages (e.g. the gensim package of Řehůřek, Radim and Petr Sojka (2010)). The embeddings for the adjectives, nouns and phrases were trained jointly on the lemmatized version of the TüBa-D/DP treebank, using the word2vec package (Mikolov et al. 2013). The word embeddings were trained with the skipgram model with negative sampling, a symmetric window of 10 as context size, 25 negative samples per positive training instance and a sample probability threshold of 0.0001. The resulting embeddings have a dimension of 200 and the vocabulary contains 476,137 words in total. The minimum frequency cut-off was set to 50 for all words. eng
2019-05-01
1
d87d06d4-51ae-4c2b-b152-e75c641b0710
8cefa5dd-f5fb-4527-8acb-88cc6824eb48
119434 phrases