![]() ![]() ![]() In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. ![]() Finally, we show that the articulatory information contained by our AAM features gives equal phone recognition performance to features containing information re- garding the front-most speech articulators. Recognition of spoken phones from visual features is shown to be poor, and increased recognition accuracy is found to be key to improved VLID. Rate-of-speech, which can indicate language fluency, is shown to assist speaker de- pendent discrimination. We show that VLID is possible in both speaker-dependent and independent modes, and that LID using audio features gives, as expected, superior performance. We investigate if a lack of articulatory information in the AAM features limits phone recognition performance, and enquire how it could be improved were more information available. We present ways of improving the speaker independency of our active appearance model (AAM) features, in tasks to identify between English and French, then later, Arabic and English. Rate-of-speech and recording session biases are investigated. We test our unsupervised method speaker dependently, identifying between the languages spoken by individual multilingual speakers. They are based upon standard audio LID techniques, which use language phonology for discrimination. This thesis introduces supervised and unsupervised methods for VLID. This has applications to LID where conventional speech recognition is ineffective, such as noisy environments or where an audio signal is unavailable. Automatic visual LID (VLID) uses the external appearance and dynamics of the speech articulators for this task, in a process known as computer lip-reading. Language identification (LID) is the task of attributing a spoken language to an utterance of speech. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |