FXE Lipsyncer tool

FXE Lipsyncer tool

[continued from Voice over questions... c]

original code by 0100010
original NwVault release:

I’ve adopted 0100010’s codebase and created a repository at Github:

demo video:
total: 2m22s, lipsyncs start at 30s

what is a lipsyncer? what is an .FXE file? what the heck is going on?

tldr; i don’t know really.

That is, i understand the application and what it’s doing; it takes a wavefile of the spoken voice plus text of what is said and breaks it down into distinct sound-types (phonemes) that are used to generate a .FXE file. The FXE file is read by the NwN2 engine to play pseudo-realistic mouth movements on the creature that speaks a voiceover.

why is text important? why can’t the stupid computer just figure it out from the wavefile?

Because… the spoken voice has an infinity of nuances. Intonation, pronunciation, also the recording itself affect how well a SpeechRecognitionEngine can interpret what is said. So the Lipsyncer makes two passes: The first pass tries to get a Recognition from a dictation lexicon, then a second pass will try to get a Recognition based only on any provided text. The results of both passes shall be displayed and you can pick the best one.

what is a SpeechRecognitionEngine?

Ah, there’s the rub. The Lipsyncer is NOT a SpeechRecognitionEngine. SpeechRecognitionEngines ship with Windows and which ones you get depend on what language of Windows you’re using. You may or may not be able to download additional SpeechRecognitionEngines for your OS.

what? why so many SpeechRecognitionEngines?

Because each one is designed to be used in a specific language or set of languages. There may be universal SpeechRecognitionEngines but I haven’t seen any in my limited investigations yet.

To find what SpeechRecognitionEngines are currently installed on your computer, goto ControlPanel|SpeechRecognition|AdvancedSpeechOptions|SpeechRecognition|Language. The dropdown lists your SpeechRecognizers with their languages.

At present, the Lipsyncer works only with engines listed as “Microsoft Speech Recognizer”. You don’t have to change anything in the dropdown as far as the Lipsyncer is concerned; the Lipsyncer has its own dropdown list that works independently – that is, changing the SpeechRecognizer in the Lipsyncer’s dropdown does NOT affect that of your operating system or vice versa.

However. I have a recognizer for EnglishUS and another for EnglishGB on my system. The US version works fine but I get garbled results with the GB version. Go figure, because at this point I can’t even assume that an EnglishUS version on say a French Windows OS is the same EnglishUS version on a US Windows OS. It’s spaghetti, especially when one tries to account for the progression of SpeechRecognition software over the years.

technical: There are several .NET namespaces that can be used. “System.Speech”, “Microsoft.Speech”, and the one that 0100010 chose for the FxeGenerator, “Interop.SpeechLib”. There also appear to be an increasing number of independently produced (non-Microsoft) APIs available.

They each appear to have their specialties and idiosyncracies. The priority seems geared toward telling your computer what to do. That’s not what we want here.

“Interop.SpeechLib” interfaces quite well through SAPI 5.4 with the EnglishUS SpeechRecognizer that’s on my machine. But a newer platform might be better for the future … because the availability of SpeechRecognizers for various languages is limited at present. Your version of Windows, in your language, might not even have one.

I don’t have a release of 0100010’s FXE Generator yet. But if you have custom voiceovers and want to demo a debug version of the rewritten and upgraded FXE Lipsyncer just send me a PM,

Windows w/ .NET 3.5 req’d


First success at French transcription …

kudos to 4760 for all testing and feedback  :)


Not only does the tool create the an fxe file, but it also makes the lips match the sound.
Congratulations @kevL_s

1 Like

trial release

2020 may 18


Tested with French lines of dialog:

  1. Never crashed,
  2. Always succeeded in creating an fxe file (even if sometimes I had to change to phonetic spelling)
  3. the lips movement match the sound.

Awesome work!


as noted by 4760, sometimes punctuation hinders recognition



and sometimes spelling should be altered:



However, I suggest first attempting a generation with grammatically correct punctuation and spelling. A speechrecognition engine will assume that spelling and punctuation are correct and use that if it can (i believe).

1 Like

phoneme editing ?


1 Like

That’d be awesome! There’s only one correct spelling, but if it doesn’t work, it might not be that easy to find the changes the SpeechRecognition will accept.

1 Like



“You don’t even know what that means, do you? You’re just having fun with your hands.”


because why not /eh

“I … thank you.”


progress: am confused by the data from the speech engine. Specifically the undocumented format of its proprietary filestream … eg The lines above have a fudged x-offset; it’s my estimation based on amplitude instead of a direct value from the speech engine

It seems best to crop custom vo-files close to the start of the first phone. ( to avoid audio lagging behind lip-movements )

la da de da

by heck or high water …


there’s a problem and a workaround …

If there’s silence at the beginning of a vo-file, the SR-engine ignores it and thinks that the first phoneme starts at t=0 sec.

I believe that throws the visemes out of sync with the phones IG. (haven’t checked that specifically yet) Unfortunately the SR-engine is making things difficult when i try to determine the duration of any beginning silence …

hence, “sync Delay” – ill hardcode an estimation but allow the user to tweak it.

I could even implement a Trim operation via NAudio

1 Like