StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<h2>EnergyDetector</h2> <p>For Voice Activity Detection, I have been using the EnergyDetector program of the <a href="http://mistral.univ-avignon.fr/en/index.html" rel="noreferrer">MISTRAL</a> (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.</p> <p>It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the <a href="http://www.irisa.fr/metiss/guig/spro/download.html" rel="noreferrer">SPro</a> signal processing toolkit in the following way:</p> <pre><code>sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm </code></pre> <p>It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.</p> <p>You will then run EnergyDetector in this way:</p> <pre><code>EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output </code></pre> <p>If you use the configuration file that you find at the end of the answer, you need to put <code>output.prm</code> in <code>prm/</code>, and you'll find the segmentation in <code>lbl/</code>.</p> <p>As a reference, I attach my EnergyDetector configuration file:</p> <pre><code>*** EnergyDetector Config File *** loadFeatureFileExtension .prm minLLK -200 maxLLK 1000 bigEndian false loadFeatureFileFormat SPRO4 saveFeatureFileFormat SPRO4 saveFeatureFileSPro3DataKind FBCEPSTRA featureServerBufferSize ALL_FEATURES featureServerMemAlloc 50000000 featureFilesPath prm/ mixtureFilesPath gmm/ lstPath lst/ labelOutputFrames speech labelSelectedFrames all addDefaultLabel true defaultLabel all saveLabelFileExtension .lbl labelFilesPath lbl/ frameLength 0.01 segmentalMode file nbTrainIt 8 varianceFlooring 0.0001 varianceCeiling 1.5 alpha 0.25 mixtureDistribCount 3 featureServerMask 19 vectSize 1 baggedFrameProbabilityInit 0.1 thresholdMode weight </code></pre> <h2>CMU Sphinx</h2> <p>The <a href="http://cmusphinx.sourceforge.net/" rel="noreferrer">CMU Sphinx</a> speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.</p> <p>A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See <a href="http://cmusphinx.sourceforge.net/wiki/gstreamer#the_vader_element" rel="noreferrer">Using PocketSphinx with GStreamer and Python -> The 'vader' element</a></p> <h2>Other VADs</h2> <p>I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload