Korpus / Textsammlung

The Royal Society Corpus Version 4.0

The Royal Society Corpus Version 4.0 eng

The Royal Society Corpus (RSC) is based on the first two centuries of the Philosophical Transactions of the Royal Society of London from its beginning in 1665 to 1869. It includes all publications of the journal written mainly in English and containing running text. The Philosophical Transactions was the first periodical of scientific writing in England. The RSC Version 4 consists of approximately 32 million tokens and is encoded for text type (abstracts, articles), author, year of publication. Information about decade and 50-year periods are also available allowing for a diachronic analysis of different granularity. We also annotate the two most important topics of each text according to a topic model consisting of 24 topics. The full topic model is also available for download. The corpus is tokenized and linguistically annotated for lemma and part-of-speech using TreeTagger (Schmid 1994, Schmid 1995). For spelling normalization we use a trained model of VARD (Baron and Rayson 2008). As a special feature, we encode with each unit (word token) its average surprisal, i.e. the average amount of information it encodes in number of bits, with words as units and trigram as contexts [cf. Genzel and Charniak 2002). The release 4.0 of the corpus includes an improved OCR correction and removal of non-text tokens like formulæ and tables. eng

Englisch

9779 Texte

public

771a56b2-5648-43d1-941d-97c3eb805b9f

c1c9b626-0a08-4962-9a02-04fd60f7cd5f

vorhanden

CLARIND-UdS: Repositorium für Sprachressourcen an der Universität des Saarlandes

corpus

Sprachwissenschaften

geschrieben

Keine verknüpften Ressourcen sind verfügbar!
Keine verknüpften Ressourcen sind verfügbar!