Refined personal name data from the census book of Vodskaja pjatina
No Thumbnail Available
Restricted Availability
Date
2021-01-13, 2021-01-13
Persistent identifier of the Data Catalogue metadata
Creator/contributor
Editor
Journal title
Journal volume
Publisher
Publication Type
dataset
Peer Review Status
Repositories
Access rights
ISBN
ISSN
Description
The data contains approximately 36,000 personal names derived from medieval Russian documentation. More preciously, names are collected from an edited version of the census book of Vodskaja pjatina, which was one of the five administrative areas in the late 15th century Novgorod.
Editions were compiled in parts and the first two, which cover the northernmost region, are called Переписная окладная книга по новугороду вотской пятины (1851, 1852)(POKV I‒II). The third part of the book series Новгородские пистсовые книги (1868)(NPK III) covers the southern and western parts of the study area.
The process of obtaining the personal from the inscription has been following: First, editions of the census book were obtained as scanned PDF files. These were transformed as editable copies by using OCR (=Optical Character Recognition) software Abbyy. The program read the original mid-19th century Russian text adequately with its old Russian alphabet package.
After the initial corrections, a Python script was written to harvest the personal names. This was based on exploiting the systematic formalities in how most of the names were presented in the census book. The script looked for abbreviations “дв.” and “д.” and extracted all following capitalized words until section end markers “.”, “;” or “:”. As an output, a name to pogost matrix was produced, which held the raw frequencies of each word in each pogost.
The process of cleaning the name data, in turn, has been done mostly by data wrangling program OpenRefine in following manner: For starters, all name forms shorter than four characters were removed as there were no personal names consisting of three or less letters. Furthermore, nouns that were not names were removed. This meant discarding expressions that described person’s special feature or profession, like such as being a widow (“вдова”) or working as a deacon (“діакъ”). For some reason, editors followed inconsistent conventions in capitalizing these non-name nouns.
In addition, some orthographical and morphological harmonization was done on the data. The letter ы was cut from the end of bynames, where it denotes plurality. Similarity of so called soft and hard signs, ь and ъ caused some problems. As the latter one is not used in contemporary Russian and was not used in the original documents either (Неволин 1853 : 4 (in Appendix 1)) it was removed. The soft sign ь was also removed because it was absent in the original documents and it had been used inconsistently by the editors. The letter ѣ (yat) is rarely used in personal names but nevertheless, it was changed to е (like as it is in contemporary Russian) as since it was often confused with soft and hard signs (ь and ъ). Furthermore, the letter ѳ (fita) was often erroneously recognized as о or е. As it is only found in NPK III and only in the beginning of certain names, which all are also written with “Ф” (e.g. “Ѳедко” vs. “Федко”), it was replaced with Ф.
In the second phase most of the erroneous orthographies were corrected. We do not detail herescribe all the OCR-errors here that were found, but in the following a short description is given of the most significant corrections. There were, for example, many letters whose similarity caused problems for the OCR-program (e.g. и / й and б / в). In these cases, the correct orthography was sought in the census book editions and accordingly, Openrefine was used to change erroneous forms to right correct ones.
After the corrections were made, the number of name types (= name variants) was reduced from 4942 to 2748. The Overall overall number of name tokens was dropped as well: from 36,405 to 35,726. Of the name types, more than half (1484) have only one occurrence.
The refined and harmonized data is published as pogost-by-name frequency tabulations (pogost, equivalent of English parish). The file is in tab-delimited file (.tsv) format.
References:
Неволин, К. А. 1853, О пятинах и погостах новгородских в XVI веке, с приложением карты, Санкт-Петербург (Из Записок Императорского русского географического общества, Кн. VIII).
NPK III = Новгородские писцовые книги, Т. 3 : Переписная оброчная книга Вотской пятины, 1500 года, 1868, 1868, Санкт Петербург.
POKV I, II = Переписная окладная книга по Новугороду Вотьской пятины, 1851, 1852, Имп. Моск. о-во истории и древностей рос., Москва.