LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora
No Thumbnail Available
Restricted Availability
Date
2020-08-20, 2020-08-20
Persistent identifier of the Data Catalogue metadata
Creator/contributor
Editor
Journal title
Journal volume
Publisher
Publication Type
dataset
Peer Review Status
Repositories
Access rights
ISBN
ISSN
Description
This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:
a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`)
40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`)
the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`)
The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary.
__Corpus 1__
based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine
language: Latin
time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC
size: ~1.7 million tokens
format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled
encoding: UTF-8
__Corpus 2__
based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine
language: Latin
time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD
size: ~9.4 million tokens
format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled
encoding: UTF-8
Find more information on the data in the papers referenced below.
Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below.
The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF).
References
Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020.
McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.