Practical Guide to Azure Custom Neural Voice: Essential Tips for Success
Custom Neural Voice (CNV) is a feature of Azure Cognitive Services that allows you to create a personalized, synthetic voice for your applications. This text-to-speech capability enables you to develop a highly natural-sounding voice for your brand or characters by using human speech samples as training data.
Recently, I worked on a project involving the generation of a custom voice, and I encountered some features and hidden issues not covered in the official documentation. Therefore, I would like to share some tips and tricks in this article. Since the theoretical aspects are well-documented, the advice in this post is primarily based on my personal experience. I hope you find these insights useful. Let’s dive in!
Audio Recording
Firstly, you need to prepare a well-balanced script. It’s important to provide a proper mix of question, exclamation, and statement sentences, as this is more crucial than ensuring the training set closely matches the target domain. In summary, a good dataset should include:
Statement sentences : 70-80%
Questions: 10-20% and equal number of rising and falling tunes (we use rising intonation on yes/no questions whereas a falling tune is very common in wh-questions)
Exclamation sentences : 10-20%
Short word/phrase : 10%
Sound editing software
There are several possible solutions, such as Adobe Audition or Audacity. I recommend using Audacity, not just because it’s free while Adobe Audition is paid, but because Audacity’s limited functionality is ideal for our needs. We only need to select the utterance, export it, and cut it out. Minimalism is the key to success. Additionally, Audacity makes it easier to navigate the tracks and allows you to minimize unnecessary toolboxes.
The File Menu in Audacity provides commands for creating, opening, and saving projects, as well as importing and exporting audio files. For instance, the exporting function is unassigned by default, allowing you to easily create a shortcut for exporting your selection. This significantly speeds up the processing. Having worked with both Adobe Audition and Audacity, I found that I could complete the same amount of work in 2 days with Audacity, compared to 4 days with Adobe Audition.
Price
Model type : Neural V5.2022.05
Engine version : 2023.01.16.0
Training hours : 30.48
Data size : 440 utterances
Price: $1584.27
The price may vary depending on the engine version and the number of training hours, but at least you have a sample.
Intake form
You probably know that access is only granted after you fill in the Intake Form and decision is based on eligibility and usage criteria. Before providing all the project information, please refer to Microsoft’s Responsible AI Standards. This will help you adjust the description and the scenario accordingly.
Audio Preparation
The process is quite straightforward. Create a notepad with all the utterances and their IDs. Select the utterances one by one, export them, save them using the ID, and then delete them from the notepad. Define the optimal size beforehand, and avoid zooming in or out during the work, as you will get used to the timeline size and be able to add the required 100-200 milliseconds of silence more easily.
Microsoft Tech Community – Latest Blogs –Read More