LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora

Hengchen, Simon

LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora

Date

2020-08-20, 2020-08-20

Creator/contributor

Hengchen, Simon

Publication Type

dataset

Repositories

Zenodo

Cite this resource

Citation Style

McGillivray, B., Schlechtweg, D., Dubossarsky, H., Tahmasebi, N., & Hengchen, S. (2020). LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora (Version 3) [Dataset]. Zenodo. https://doi.org/10.5281/ZENODO.3992738

Description

This data collection contains the Latin test data for SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection:  a Latin text corpus pair (`corpus1/lemma`, `corpus2/lemma`) 40 lemmas which have been annotated for their lexical semantic change between the two corpora (`targets.txt`) the annotated binary change scores of the targets for subtask 1, and their annotated graded change scores for subtask 2 (`truth/`) The corpus data have been automatically lemmatized and part-of-speech tagged, and have been partially corrected by hand. For homonyms, the lemmas are followed by the '\#' symbol and the number of the homonym according to the Lewis-Short dictionary of Latin when this number is greater than 1. For example, the lemma 'dico' corresponds to the first homonym in the Lewis-Short dictionary and 'dico\#2' corresponds to the second homonym, cf. Lewis-Short dictionary. __Corpus 1__ based on: LatinISE (McGillivray and Kilgarriff 2013), version on Sketch Engine language: Latin time covered: from the beginning of the second century before Christ (BC) to the end of the first century BC size: ~1.7 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 __Corpus 2__ based on: LatinISE (McGillivray and Kilgarriff 2013) , version on Sketch Engine language: Latin time covered: from the beginning of the first century after Christ (AD) to the end of the twenty-first century AD size: ~9.4 million tokens format: lemmatized, sentence length >= 2, no punctuation, sentences randomly shuffled encoding: UTF-8 Find more information on the data in the papers referenced below. Besides the official lemma version of the corpora for SemEval-2020 Task 1 we also provide the raw token version (corpus1/token/, corpus2/token/). It contains the raw sentences in the same order as in the lemma version. Find more information on the data and SemEval-2020 Task 1 in the paper referenced below. The creation of the data was supported by the CRETA center and the CLARIN-D grant funded by the German Ministry for Education and Research (BMBF). References Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky and Nina Tahmasebi SemEval 2020 Task 1: Unsupervised Lexical Semantic Change Detection. To appear in SemEval@COLING2020. McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics, Tübingen: Narr.  

Link to original dataset

https://doi.org/10.5281/zenodo.3992738

Keywords

Latin, corpus

View full metadata

University of Helsinki

University of Helsinki Data catalogue

LatinISE test data for SemEval 2020 task 1 with additional token versions of the corpora

Restricted Availability

Date

Persistent identifier of the Data Catalogue metadata

Creator/contributor

Editor

Journal title

Journal volume

Publisher

Publication Type

Peer Review Status

Repositories

Access rights

Cite this resource

ISBN

ISSN

Description

Link to original dataset

Keyword (yso)

Keywords

Publication Series

Journal title

Location of the original dataset