Evaluating naturalness of voice impersonations by subjective and objective measures.
Ng, Chen Yi.
Date of Issue2013
School of Electrical and Electronic Engineering
This project is to find out the reasons why some impersonated voices is able to deceive people and whether it is possible to quantify the voices. As there is a rise in crimes related to voice impersonation, it is important to know how a person changes his/her voice to another person’s voice and what are the factors which a human determines whether the voice is disguised or not. Subjective and objective measure is used for this project. The project has a database of voices from three speakers. Each speaker has 9 voices, 8 impersonated voices and 1 natural voice. Each of the voices has 9 sentences. Therefore there are a total of 243 files which will be used for subjective testing. For subjective measure, a trial test was conducted to find out how well a person is able to distinguish impersonated voices. A random generator is created to randomize the voices in the database that are going to be used for the trial test. A graphic user interface is made to facilitate the listener to input his/her decision on whether the voice is disguised or not when doing the test and to play the voices one by one for the listener. Each listener will rate all voices from the 3 speakers. The result obtained from the test is that about 86% of the listeners were able to correct identify the natural voice of the speakers. All of the listeners are also able to identify 5 out of 8 impersonated voices from each speaker. For objective measure, it is to find out the effects of changing the pitch and formants on a voice and also the range of pitch and formants which a synthesized voice sounds natural to a human. Pitch refers to the fundamental frequency of the voice. Formant is defined as the spectral peaks of a sound spectrum |p(f)|  and they denotes the vowels. Investigations were made and found that changing the pitch of a voice changes the gender of the source voice and changing the formants of a voice changes the age category of the voice. If the source voice is middle aged male, changing the formants is able to turn the voice into a young male voice or a voice of a child. For a voice to sound natural, a correct combination of pitch and formants is required. Range of pitch from 50Hz to 500Hz with a step size of 30Hz, a formant range of 0.1 to 2 with a step size of 0.1, at least a natural sounding synthesize voice at each step of the pitch. With the two parameters, it is able to quantify the voice as the range of pitch and formants can be found for a synthesized voice to sound natural. Human can clearly differentiate a disguised voice and a person’s daily voice. For future work, the subjective testing can take place on a larger scale to get a more accurate result and more parameters such as the uniqueness and the naturalness of a voice. The database used for the test can be expanded using the natural sounding synthesized voices from the objective measurement. For objective measurement, a more accurate range of formants which changes the age category of a voice can be investigated.
Final Year Project (FYP)
Nanyang Technological University