Restoring tone-marks in standard yorùbá electronic text: improved model

Szczegóły
Abstrakt

Tytuł:: Restoring tone-marks in standard yorùbá electronic text: improved model
Autorzy:: Asahiah, F. O.
Odejobi, O. A.
Adagunodo, E. R.
Data publikacji:: 2017
Słowa kluczowe:: diacritic restoration
syllables
characters
word-level accuracy
Język:: angielski
Dostawca treści:: BazTech
: Artykuł

Diacritic Restoration is a necessity in the processing of languages with Latinbased scripts that utilizes letters outside the basic Latin alphabet used by English language. Yorùbá is one such languages, marking underdot (dot-below)on three characters and tone marks on all seven vowels and two syllabic nasals. The problem of restoring underdotted characters has been fairly addressed using character as linguistic units for restoration. However, the existing characterbased approaches and word-based approach has not been able to sufficiently address restoration of tone marks in Yorùbá. We address in this study tone marks restoration as a subset of diacritic restoration. We proposed using the syllable (derived from word) as the linguistic token for tone marks restoration. In our experimental setup, we used Yoruba text collected from various sources as data with total word count of 250,336 words. These words, on syllabification, yielded 464,274 syllables. The syllables were divided into training and testing data in different proportions ranging from 99% used for training and 1% used for testing to 70% used for training and 30% used for testing. The aim of evaluation different proportions was to determine how the ratio of training-to-test data affect the variations that may occur in the result. We applied Memory-based learning to train the models. We also set up a similar experiment using character token to be able to compare the performance. The result showed that using syllable was able to increase accuracy at word level to 96.23% and an average of almost 15% over that gotten from using character. We also found out that using 75% of data for training and the remaining 25% for testing gives the results with the least variation in a ten-fold cross validation test. Hybridizing this method that uses syllabless as processing linguistic units with other methods like lexicon lookup might likely lead to improvement over the current result.

Opracowanie ze środków MNiSW w ramach umowy 812/P-DUN/2016 na działalność upowszechniającą naukę (zadania 2017).

Informacja

Restoring tone-marks in standard yorùbá electronic text: improved model