Quickstart Azure Cognitive Services Speech Recognition
Help UKRAINE ! Your action matters! Donate money to support Ukrainian Army! Donate money to charity fund! Sign petition to SAVE MARIUPOL! Organize/join street protests in your city to support Ukraine and condemn Russian aggression! Expose and report Russian disinformation! #StandWithUkraine
In the article below I will use ACS as an abbreviation for Azure Cognitive Services for readability purpose.
Cognitive Services Speech Recognition Overview
Let’s talk about what is Cognitive Services before we start. I will quote MS on this one:
Cognitive Services bring AI within reach of every developer – without requiring machine-learning expertise. All it takes is an API call to embed the ability to see, hear, speak, search, understand and accelerate decision-making into your apps.
Cognitive Services are sets of different API that allow you to perform various AI-related operations.
The topic I was curious about is the ability to translate Speech To Text. You can say why do we need yet another Speech To Text when we have for example Web Speech API? Well, first of all, Web Speech API is not widely supported (for example Safari doesn’t support it at all). Second, it will not allow you to recognize the audio file, only recording trough browser itself.
Speech part of Cognitive Services allows you to do the next things:
- Speech To Text
- Text to Speech
- Intent recognition
- Speech translation
- Conversation transcription
- Voice assistants
In this article, I will explain mostly usage of Speech To Text.
Create Azure Cognitive Service resource
The first thing that we need is to create our ACS resource. Navigate to your Azure Portal. Press Create a resource and search for Cognitive Services.
Press Create, then select you resource Name, select Subscription, Location, Pricing tier (S0 will work for trial and testing) and your resource group (if you don’t have one create a new one). Then press Create button. It can take a couple of minutes before your resource will be available.
Use ACS Speech SDK
Install ACS Speech SDK in your project
To use ACS Speech SDK in our project (PCF or any other JS project) we need to first install it. We can do it with the help of npm. Run next command in a console to install:
After the installation, we need to import SDK to use it. Use below the line of code to import it:
Now we are able to use SDK in our project.
Main methods and events of ACS Speech SDK
To be able to perform speech recognition we will need to create SpeechRecognizer from SDK. To do so we need to create special objects - AudioConfig and SpeechConfig.
AudioConfig will contain all necessary information about audio input method.
To create AudioConfig object there are different method inside SpeechSDK.AudioConfig. Two most commonly are create from audio file or microphone.
Create from the audio file (NOTE: right now only wav files are supported):
where audioFile - is File type object obtained from HTML input or generated in code.
Create from microphone input:
Now when we have our AudioConfig object we need to create the next one: SpeechConfig. This object contains information about your subscription, region and recognition language.
You can create SpeechConfig object using a couple of methods:
- fromAuthorizationToken - creates SpeechConfig instance from authorization token and region;
- fromEndpoint - creates SpeechConfig instance with specified endpoint and subscription key;
- fromSubscription - creates SpeechConfig instance with the specified region and subscription key.
To find out more about SpeechConfig check official reference page.
In this tutorial, we will use the simplest one - fromSubscription. For this method, we will need to know our subscription key and region. To find them open your Cognitive Service resource. Under Quick start section, you will be able to find your Subscription Key (see image below).
To find your region open Overview section and find Location.
As subscription region, you need to pass a location in lowercase and without spaces. For example, my resource location is North Europe so I will pass “northeurope” as the region.
Create SpeechConfig using code below:
Next, we need to specify speech recognition language as part of SpeechConfig. In my case, I will use English (US) - “en-US”. You can find a list of supported languages via next link.
To add language to SpeechConfig use code below:
Now we have everything to create a SpeechRecognizer object:
After the speech recognizer creation, we can start to perform recognition operation itself. We have two recognition methods: recognizeOnceAsync and startContinuousRecognitionAsync. They suit different purpose:
- recognizeOnceAsync - short form audio recognition. Waits till the long silence or 15 seconds of audio;
- startContinuousRecognitionAsync - for continuous audio stream recognition. Will stop only after stopContinuousRecognitionAsync method is called. User needs to subscribe to events to receive and evaluate results.
In this tutorial, we will focus on the first method.
recognizeOnceAsync has two optional parameters - success callback function and error callback function. If you will call recognizeOnceAsync without parameters you need to subscribe to recognized event to receive results (will be shown later). In the success callback function will be passed SpeechRecognitionResult object that has two options that we will need to determine the execution result - reason and text. reason property is enum containing different result code like SpeechSDK.ResultReason.RecognizedSpeech or SpeechSDK.ResultReason.Canceled etc. If the reason is RecognizedSpeech result object will also contain text property - final recognition result.
See the below code as an example:
Finally, let’s explore event handlers that available for us in SpeechRecognizer class:
- sessionStarted - will be triggered when the new session is started with the speech service;
- sessionStopped - will be triggered when the session ends;
- speechStartDetected - will be triggered when speech start detected;
- speechEndDetected - will be triggered when speech end detected;
- recognizing - will be triggered when intermediate recognition results are received;
- recognized - will be triggered when the final recognition result received.
For example, you want to show what data is processed by recognition service right now. You can subscribe to recognizing the event and show intermediate recognition results. See code below:
If you want to find more about SpeechRecogizer class visit official reference page.
Using ACS in PCF (local development)
This part applies describes specifics of usage ACS in PCF components or to be more particular to the local testing in harness.
As described previously two most common options for creating AudioConfig are from wav file or the microphone. To test from file option I added HTML input to my test component. To my surprise, it was hidden. Simple check reviled to me that by default harness add styling that hides all input of type “file” and remove all events. I think this is because we have a method available for us under context.device. It is called pickFile (don’t forget to enable Device.pickFile under feature-usage in ControlManifest.Input.xml - see image below).
However when I tried to use context.device.pickFile it didn’t open file selection window and just returned undefined as a result. After small investigation, I noticed that in harness they use Xrm.Proxy that returned undefined by default. In the real environment, it works as intended. So what should we do in local development? Well, for me the solution was to use regular HTML input in local development and for real deployment switch to pickFile. To override default styling for the input you can use CSS below:
Next thing that you need to one when using pickFile it returns FileObject array. However, AudioConfig constructor allows only using File. So you will need to have small convert function. You can use my code below:
You can find full code with a sample component via this link.