- 
                Notifications
    You must be signed in to change notification settings 
- Fork 382
Phoneme Detection and Classifier Model Codes #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            AnirudhBHarish
  wants to merge
  8
  commits into
  microsoft:master
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
AnirudhBHarish:kws_phoneme
  
      
      
   
  
    
  
  
  
 
  
      
    base: master
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          
  
     Open
                    Changes from 6 commits
      Commits
    
    
            Show all changes
          
          
            8 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      c43e777
              
                Phoneme detection and classifier model codes
              
              
                AnirudhBHarish 7203332
              
                Add license
              
              
                AnirudhBHarish 43a9b37
              
                Remove redundant functions
              
              
                AnirudhBHarish 113ab23
              
                finish documenting kwscnn
              
              
                AnirudhBHarish 35e0159
              
                Fix typos
              
              
                AnirudhBHarish 11b718e
              
                Fix typos and punctuation
              
              
                AnirudhBHarish ecd1d09
              
                Minor modifications to comments and punctuation
              
              
                AnirudhBHarish 8c78dad
              
                Incorporate reviewer comments
              
              
                AnirudhBHarish File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| # Phoneme-based Keyword Spotting(KWS) | ||
|  | ||
| # Project Description | ||
| There are two major issues in the existing KWS systems (a) They are not robust to heavy background noise and random utterances, and (b) They require collecting a lot of data, hampering the ease of adding a new keyword. Tackling these issues from a different perspective, we propose a new two staged scheme with a model for predicting phonemes which are in turn used for phoneme-based keyword classification. | ||
|  | ||
| First we train a phoneme classification model which gives the phoneme transcription of the input speech snippet. For training this phoneme classifier, we use a large public speech dataset like LibriSpeech. The public dataset can be aligned (meaning we can get the phoneme labels for each speech snippet in the data) using Montreal Forced Aligner. We also add reverberations and additive noise to the speech samples from the public dataset to make the phoneme classifier training robust to various accents, background noise and varied environments. In this project, we predict phonemes at every 10ms which is the standard way. You can find the aligned LibriSpeech dataset we used for training here. | ||
|  | ||
| In the second part, we use the predicted phoneme outputs from the phoneme classifier for predicting the input keyword. We train a 1 layer FastGRNN classifier to predict the keyword based on the phoneme transcription as input. Since the phoneme classifier training has been done to account for diverse accents, background noise and environments, the keyword classifier can be trained using a small number of Text-To-Speech(TTS) samples generated using any standard TTS API from cloud services like Azure, Google Cloud or AWS. | ||
|  | ||
| This gives two advantages: (a) The phoneme model is trained to account for diverse accents and background noise settings, thus the flexible keyword classifier training requires only a small number of keyword samples, and (b) Empirically this method was able to detect keywords from as far as 9ft of distance. Further, the phoneme model has a small size of around 250k parameters and can fit on a Cortex M7 micro-controller. | ||
|  | ||
| # Training the Phoneme Classifier | ||
| 1) Train a phoneme classification model on some public speech dataset like LibriSpeech. | ||
| 2) Training speech dataset can be labelled using Montreal Force Aligner. | ||
| 3) Speech snippets are convolved with reverberation files, and additive noises from YouTube or other open source are added. | ||
| 4) We also add white gaussian noise of various SNRs. | ||
|  | ||
| # Training the KWS Model | ||
| 1) Our method takes as input the speech snippet and passes it through the phoneme classifier. | ||
| 2) Keywords are detected by training a keyword classifier over the detected phonemes. | ||
| 3) For training the keyword classifier, we use Azure and Google Text-To-Speech API to get the training data (keyword snippets). | ||
| 4) For example, if you want to train a keyword classifier for the keywords in the Google30 dataset, generate TTS samples from the Azure/Google-Cloud/AWS API for each of the 30 keywords. The TTS samples for each keyword must be stored in a separate folder named according to the keyword. More details about how the generated TTS data should be stored are mentioned below in sample use case for classifier model training. | ||
|  | ||
| # Sample Use Cases | ||
|  | ||
| ## Phoneme Model Training | ||
| The following command can be used to instantiate and train the phoneme model. | ||
| ``` | ||
| python train_phoneme.py --base_path=/path/to/librispeech_data/ --rir_base_path=/path/to/reverb_files/ --additive_base_path=/path/to/additive_noises/ --snr_samples="0,5,10,25,100,100" --rir_chance=0.5 | ||
| ``` | ||
| Some important command line arguments: | ||
| 1) base_path : Path of the speech data folder. The data in this folder should be in accordance to the data-loader code written here. | ||
| 2) rir_base_path, additive_base_path : Path to the reverb and additive noise files. | ||
| 3) snr_samples : List of various SNRs at which the additive noise is to be added. | ||
| 4) rir_chance : Probability that would decide if the reverberation operation has to be performed for a given speech sample. | ||
|  | ||
| ## Classifier Model Training | ||
| The following command can be used to instantiate and train the classifier model. | ||
| ``` | ||
| python train_classifier.py --base_path=/path/to/train_and_test_data_folders/ --train_data_folders=google30_azure_tts,google30_google_tts --test_data_folders=google30_test --phoneme_model_load_ckpt=/path/to/checkpoint/x.pt --rir_base_path=/mnt/reverb_noise_sampled/ --additive_base_path=/mnt/add_noises_sampled/ --synth | ||
| ``` | ||
| Some important command line arguments: | ||
|  | ||
| 1) base_path : Path to train and test data folders. | ||
| 2) train_data_folders, test_data_folders : These folders should have the .wav files for each keyword in a separate subfolder inside according to the data-loader here. | ||
| 3) phoneme_model_load_ckpt : The full path of the checkpoint file that would be used to load the weights to the instantiated phoneme model. | ||
| 4) rir_base_path, additive_base_path : Path to the reverb and additive noise files. | ||
| 5) synth : Boolean flag for specifying if reverberations and noise addition has to be done. | ||
|  | ||
| Copyright (c) Microsoft Corporation. All rights reserved. | ||
| Licensed under the MIT license. | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| # Auxiliary Files to help Download and Prepare the Data | ||
|  | ||
| ## YouTube Additive Noise | ||
| Run the following commands to download the CSV Files to download the YouTube Additive Noise Data : | ||
|  | ||
| ``` | ||
| wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/balanced_train_segments.csv | ||
| ``` | ||
| Followed by the extraction script to download the actual data : | ||
| ``` | ||
| python download_youtube_data.py --csv_file=/path/to/csv_file.csv --target_folder=/path/to/target/folder/ | ||
| ``` | ||
|  | ||
| Please check [Google's Audioset data page](https://research.google.com/audioset/download.html) for further details. | ||
|  | ||
| The downloaded files would need to be converted to 16KHz for our pipeline. Please run the following for the same : | ||
| ``` | ||
| python convert_sampling_rate.py --source_folder=/path/to/csv_file.csv --target_folder=/path/to/target/16KHz_folder/ --fs=16000 --log_rate=100 | ||
| ``` | ||
| The script can convert the sampling rate of any .wav file to the specified --fs. But for our applications, we use 16KHz only.<br/> | ||
|         
                  ShikharJ marked this conversation as resolved.
              Outdated
          
            Show resolved
            Hide resolved | ||
| Choose the log rate for how often the log should be printed for the sample rate conversion. This will print a string every log_rate iterations. | ||
|  | ||
| Copyright (c) Microsoft Corporation. All rights reserved. | ||
| Licensed under the MIT license. | ||
        
          
          
            45 changes: 45 additions & 0 deletions
          
          45 
        
  applications/KWS_Phoneme/auxiliary_files/convert_sampling_rate.py
  
  
      
      
   
        
      
      
    
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| # Copyright (c) Microsoft Corporation. All rights reserved. | ||
| # Licensed under the MIT license. | ||
|  | ||
| import os | ||
| import librosa | ||
| import numpy as np | ||
| import soundfile as sf | ||
| import argparse | ||
|  | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument('--source_folder', default=None, required=True) | ||
| parser.add_argument('--target_folder', default=None, required=True) | ||
| parser.add_argument('--fs', type=int, default=16000) | ||
| parser.add_argument('--log_rate', type=int, default=1000) | ||
| args = parser.parse_args() | ||
|  | ||
| source_folder = args.source_folder | ||
| target_folder = args.target_folder | ||
| fs = args.fs | ||
| log_rate = args.log_rate | ||
| print(f'Source Folder :: {source_folder}\nTarget Folder :: {target_folder}\nSampling Frequency :: {fs}', flush=True) | ||
|  | ||
| source_files = [] | ||
| target_files = [] | ||
| list_completed = [] | ||
|  | ||
| # Get the list of list of wav files from source folder and create target file names (full paths) | ||
| for i, f in enumerate(os.listdir(source_folder)): | ||
| if f[-4:].lower() == '.wav': | ||
| source_files.append(os.path.join(source_folder, f)) | ||
| target_files.append(os.path.join(target_folder, f)) | ||
| print(f'Saved all the file paths, Number of files = {len(source_files)}', flush=True) | ||
|  | ||
| # Convert the files to args.fs | ||
| # Read with librosa and write the mono channel audio using soundfile | ||
| print(f'Converting all files to {fs/1000} Khz', flush=True) | ||
| for i, file_path in enumerate(source_files): | ||
| y, sr = librosa.load(file_path, sr=fs, mono=True) | ||
| sf.write(target_files[i], y, sr) | ||
| list_completed.append(target_files[i]) | ||
| if i % log_rate == 0: | ||
| print(f'File Number {i+1}, Shape of Audio {y.shape}, Sampling Frequency {sr}', flush=True) | ||
|  | ||
| print(f'Number of Files saved {len(list_completed)}') | ||
| print('Done', flush=True) | 
        
          
          
            42 changes: 42 additions & 0 deletions
          
          42 
        
  applications/KWS_Phoneme/auxiliary_files/download_youtube_data.py
  
  
      
      
   
        
      
      
    
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # Copyright (c) Microsoft Corporation. All rights reserved. | ||
| # Licensed under the MIT license. | ||
|  | ||
| import csv | ||
| import os | ||
| import argparse | ||
|  | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument('--csv_file', default=None, required=True) | ||
| parser.add_argument('--target_folder', default=None, required=True) | ||
| args = parser.parse_args() | ||
|  | ||
| with open(args.csv_file, 'r') as csv_f: | ||
| reader = csv.reader(csv_f, skipinitialspace=True) | ||
| # Skip 3 lines ; Header | ||
| next(reader) | ||
| next(reader) | ||
| next(reader) | ||
| for row in reader: | ||
| # Logging | ||
| print(row, flush=True) | ||
| # Link for the Youtube Video | ||
| YouTube_ID = row[0] # "-0RWZT-miFs" | ||
| start_time = int(float(row[1])) # 420 | ||
| end_time = int(float(row[2])) # 430 | ||
| # Construct downloadable link | ||
| YouTube_link = "https://youtu.be/" + YouTube_ID | ||
| # Output Filename | ||
| output_file = f"{args.target_folder}/ID_{YouTube_ID}.wav" | ||
| # Start time in hrs:min:sec format | ||
| start_sec = start_time % 60 | ||
| start_min = (start_time // 60) % 60 | ||
| start_hrs = start_time // 3600 | ||
| # End time in hrs:min:sec format | ||
| end_sec = end_time % 60 | ||
| end_min = (end_time // 60) % 60 | ||
| end_hrs = end_time // 3600 | ||
| # Start and End time args | ||
| time_args = f"-ss {start_hrs}:{start_min}:{start_sec} -to {end_hrs}:{end_min}:{end_sec}" | ||
| # Command Line Execution | ||
| os.system(f"youtube-dl -x -q --audio-format wav --postprocessor-args '{time_args}' {YouTube_link}" + " --exec 'mv {} " + f"{output_file}'") | ||
| print('', flush=True) | 
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Uh oh!
There was an error while loading. Please reload this page.