There are numerous multimedia applications such as talking head, lip reading, lip synchronization, and computer assisted pronunciation training, which entices researchers to bring clustering and analyzing viseme into focus. With respect to the fact that clustering and analyzing visemes are language dependent process, we concentrated our research on Persian language, which indeed has suffered from the lack of such study. To this end, we proposed a novel adopting image-based approach which consists of four main steps including (a) extracting the lip region, (b) obtaining Eigenviseme of each phoneme considering coarticulation effect, (c) mapping each viseme into its subspace and other phonemes' subspaces in order to create the distance matrix so as to calculate the distance between viseme's cluster, and finally (d) comparing similarity of each viseme based on the weight value of reconstructed one. In order to indicate the robustness of the proposed algorithm, three sets of experiments were conducted on Persian and English databases in which Consonant/Vowel and Consonant/Vowel/Consonant syllables were examined. The results indicated that the proposed method outperformed the observed state-of-the-art algorithms in feature extraction, and it had a comparable efficiency in generating adequate clusters. Moreover, obtained results reached a milestone in grouping Persian visemes with respect to the perceptual test given by volunteers.
- Audio/visual processing
- Computer assisted pronunciation training
- Eigen space
- Multimedia systems
- Persian viseme clustering