Practical Guide to Azure Custom Neural Voice: Essential Tips for Success

A teaser image generated by DALL E 3

Custom Neural Voice (CNV) is a feature of Azure Cognitive Services that allows you to create a personalized, synthetic voice for your applications. This text-to-speech capability enables you to develop a highly natural-sounding voice for your brand or characters by using human speech samples as training data.

Recently, I worked on a project involving the generation of a custom voice, and I encountered some features and hidden issues not covered in the official documentation. Therefore, I would like to share some tips and tricks in this article. Since the theoretical aspects are well-documented, the advice in this post is primarily based on my personal experience. I hope you find these insights useful. Let’s dive in!

Audio Recording

Firstly, you need to prepare a well-balanced script. It’s important to provide a proper mix of question, exclamation, and statement sentences, as this is more crucial than ensuring the training set closely matches the target domain. In summary, a good dataset should include:

Statement sentences : 70-80%

Questions: 10-20% and equal number of rising and falling tunes (we use rising intonation on yes/no questions whereas a falling tune is very common in wh-questions)

Exclamation sentences : 10-20%

Short word/phrase : 10%

You can refer to this repository to compose your dataset. In the repository, statement sentences are labeled starting with 00, questions with 01, and exclamations with 02.

Secondly, it is highly beneficial to have a monitor in the recording room. If a monitor is not available, you can print out three copies of your script: one for yourself, one for the voice talent, and one for the sound engineer (if that’s not you). The most convenient format is a Word table with three columns: number, utterance, and status (to mark the processed phrases). Don’t forget to alternate the shading of rows or columns in the table to facilitate easier navigation.

Finally, I recommend recording all the utterances in one session with regular pauses, rather than saving the recordings in segments. I’ve tried both approaches and can assure you that there’s no performance gain from multiple exports (e.g., 0-100, 100-200, etc.). Additionally, segmenting the recordings makes the process longer, which is especially problematic if your voice talent has a busy schedule.

If there are errors during the recording, don’t cut out these parts during the session. Instead, note the timestamp and remove the errors during the pre-processing stage. This is why it’s better to make a single, large export; with your timestamps, you can easily locate and correct the errors later. Make sure to include long pauses between the utterances (at least 3 seconds) to facilitate easier editing during the processing stage.

Sound editing software

There are several possible solutions, such as Adobe Audition or Audacity. I recommend using Audacity, not just because it’s free while Adobe Audition is paid, but because Audacity’s limited functionality is ideal for our needs. We only need to select the utterance, export it, and cut it out. Minimalism is the key to success. Additionally, Audacity makes it easier to navigate the tracks and allows you to minimize unnecessary toolboxes.

The File Menu in Audacity provides commands for creating, opening, and saving projects, as well as importing and exporting audio files. For instance, the exporting function is unassigned by default, allowing you to easily create a shortcut for exporting your selection. This significantly speeds up the processing. Having worked with both Adobe Audition and Audacity, I found that I could complete the same amount of work in 2 days with Audacity, compared to 4 days with Adobe Audition.

Price

Here’re my project details

Model type : Neural V5.2022.05

Engine version : 2023.01.16.0

Training hours : 30.48

Data size : 440 utterances

Price: $1584.27

The price may vary depending on the engine version and the number of training hours, but at least you have a sample.

Intake form

You probably know that access is only granted after you fill in the Intake Form and decision is based on eligibility and usage criteria. Before providing all the project information, please refer to Microsoft’s Responsible AI Standards. This will help you adjust the description and the scenario accordingly.

Audio Preparation

The process is quite straightforward. Create a notepad with all the utterances and their IDs. Select the utterances one by one, export them, save them using the ID, and then delete them from the notepad. Define the optimal size beforehand, and avoid zooming in or out during the work, as you will get used to the timeline size and be able to add the required 100-200 milliseconds of silence more easily.

You can pre-upload your training files to the Speech Studio before starting the training process. The studio will analyze the quality of your data and generate the following scores and notes:

Pronunciation Score: Measures how correctly each word is spoken in your audio. It is calculated based on a speech recognition model and converted to a score system, with 100 as the highest. Typically, a score below 70 indicates some spoken error or script mismatch. Note that a heavy accent can reduce your pronunciation score and affect the quality of the generated digital voice.

SNR Score: Measures the signal-to-noise ratio. A higher score indicates lower noise in your audio. It is crucial to control the noise level in your training data. Ensure the voice is recorded in a quiet room with a high-quality microphone. Professional studio recordings can typically achieve an SNR score of 50 or higher. Audios with an SNR score below 20 may result in noticeable noise in the generated voice.

Duration: Represents the length of the audio file in the format hh:mm:ss[.fffffff]. Here, hh stands for hours, mm for minutes, ss for seconds, and fffffff for fractional seconds. Fractional seconds are included only if they exist and are always expressed using seven decimal digits.

Import Status: Indicates whether the specified audio file was successfully imported.

Note: Provides the reason for any import failure or additional information regarding the specified audio file.

By pre-uploading your files, you can ensure they meet the necessary quality standards for optimal training results.

Automatic Suspend

The endpoint hosting may be expensive, so some companies prefer to keep the endpoint up and running only during working hours. Instead of managing this manually, you might want to automate it. During my first project, I considered creating a Power Automate job to click the suspend button, but fortunately, there’s now a suspend/resume endpoint available through the REST API. For instance, you can create a time-triggered Azure Function, which can save you at least 30% on costs.

Note that while it is possible to integrate the custom voice endpoint into your Virtual Networks, the API to suspend/resume voice models does not support Private Endpoints.

Conclusion

Azure Custom Neural Voice is an excellent service that enables you to create a high-quality custom text-to-speech solution. Unlike other services that claim to clone any voice with just 3 seconds of audio, this solution is designed for deployment in production environments, ensuring studio-quality recordings and adherence to ethical standards. In this article, I have shared some practical tips that could potentially save you several hours or even days. I hope you find them useful.

Microsoft Tech Community – Latest Blogs –Read More

Cart

Cart