Perceptual voice high quality dimensions describe key traits of atypical speech and different speech modulations. Right here we develop and consider voice high quality fashions for seven voice and speech dimensions (intelligibility, imprecise consonants, harsh voice, naturalness, monoloudness, monopitch, and breathiness). Probes had been educated on the general public Speech Accessibility (SAP) venture dataset with 11,184 samples from 434 audio system, utilizing embeddings from frozen pre-trained fashions as options. We discovered that our probes had each sturdy efficiency and robust generalization throughout speech elicitation classes within the SAP dataset. We additional validated zero-shot efficiency on extra datasets, encompassing unseen languages and duties: Italian atypical speech, English atypical speech, and affective speech. The sturdy zero-shot efficiency and the interpretability of outcomes throughout an array of evaluations suggests the utility of utilizing voice high quality dimensions in talking style-related duties.
- † Work finished whereas at Apple