北京大学学报(自然科学版)

面向大语料库的语音合成方法研究

于延锁,朱风云,李先刚,刘翼,吴玺宏   

  1. 北京大学言语听觉研究中心, 机器感知与智能教育部重点实验室, 北京100871;
  • 收稿日期:2013-05-09 出版日期:2014-09-20 发布日期:2014-09-20

Research on Speech Synthesis for Large-Scale Corpora

YU Yansuo, ZHU Fengyun, LI Xiangang, LIU Yi, WU Xihong   

  1. Speech and Hearing Research Center, Key Laboratory of Machine Perception MOE, Peking University, Beijing 100871;
  • Received:2013-05-09 Online:2014-09-20 Published:2014-09-20

摘要: 针对几百小时粗标注大语料库, 提出一种新颖的语音合成系统构建方法。首先, 借助于语音识别、文本对齐和句法分析等技术实现大语料库的自动筛选与标注。 然后, 为了有效解决大语料库声学模型训练中存在的内存空间以及计算时间开销过大等问题, 优化了传统的训练流程, 在不损失声学模型准确性的前提下, 显著提高了模型的训练速度。主观实验表明, 与具有精标注的小语料库相比, 引入粗标注的大语料库可以带来0.5分左右的MOS提升。

关键词: 语音数据筛选, 声学模型训练, 基于HMM的单元挑选与波形拼接

Abstract: Aiming at roughly labeled corpora with several hundred hours of speech, a novel approach of constructing text-to-speech system is proposed. This approach realizes automatically cleaning and labeling of large-scale corpora by means of speech recognition, text alignment and syntactic parsing. Furthermore, in order to solve the problems of memory space expansion and time consumption for acoustic model training of large-scale corpora, a fast training method, which can ensure the accuracy of acoustic model, is realized through the optimization of conventional process of model training. Subjective evaluations show that the exploitation of large-scale corpora with rough transcription can achieve significant improvement at 0.5 MOS score in contrast with small-scale corpora with exact transcription.

Key words: speech data selection, acoustic model training, HMM-based unit selection and waveform concatenation

中图分类号: