Voice over WI-FI with ESP8266


I was reading about this NSA hack where they use a Samsung Smart TV to record audio data in order to potentially spy on its users: https://wikileaks.org/ciav7p1/cms/page_12353643.html
I wondered on what would it take to build a similar spy microphone that could send audio over WI-FI, be cheap and small enough to hide it.
What I came up is a system that uses one ESP8266(NODEMCU) board, one PIC18F25K22 micrcontroller and one MCP6024 amplifier in order to prove that this is possible.
The system is pretty cheap but not very small.

I built this system on a breadboard and as expected, is bulky. For a proof of concept I thinks it’s good enough.
It could be made smaller by using only the ESP8266 chip instead of all the NODEMCU board, and by replacing the PIC and audio amplifier with smaller chips, maybe their SMD versions, or other smaller chips with the same capabilities. A small battery would also be necessary. Of course, all of them soldered on a custom PCB.


The system works like this:
– PIC microcontroller samples audio data from the microphone at a 10KHz frequency with 8 bits resolution and puts it in a 512 bytes buffer
– when the buffer is full, the PIC pulls down a GPIO connected to the NODEMCU board
– the GPIO triggers an interrupt on the ESP8266 MCU from the NODEMCU board, and the ESP8266 reads the 512 bytes of audio data from the PIC over SPI
– the ESP8266 MCU then sends the audio data over WI-FI to a listening TCP server
– the TCP server can record or play the audio that it received over the WI-FI

Code for PIC micocrontroller is written in C and compiled with the Microchip’s free XC8 compiler.
The code for the ESP8266 is written in C using the API from the ESP8266 NON-OS SDK.

Since the sampling rate is 10KHz, this is OK for voice but it can’t be used for music.
The sampling rate used in classic telephony is 8KHz and you can understand what the other person is saying and even identify it from it’s voice.
A low pass filter is necessary to remove the components above the 10KHz frequency to avoid aliasing.
My schematic doesn’t include this low pass filter.

Since the audio sample buffer is 512 bytes and the sample rate is 10KHz, the time necessary to transfer the audio samples from the PIC to ESP and from ESP to the TCP server over WI-FI, needs to be less than 51,2 ms.
The time to transfer the audio data over SPI is ~13ms and the time to transfer the data over WI-FI is usually less than 10ms.
This is well bellow our limit of 51,2ms.
But sometimes, the time to transfer the audio samples over WIFI can take much more, even 2-3 seconds, and this means that we can’t respect the 51,2ms limit, so buffer underruns happen.
These buffer underruns cause some pause and glitch in the audio but after that, the audio recovers OK.
It’s strange that usually the transfer of data over WI-FI takes less than 10ms but sometimes it blocks for 2-3 seconds.
In order to measure the data transfer speed over WI-FI, I used a GPIO that I would put low when starting a data transfer.
When the transfer is finnished the ESP executes a callback function where I put the GPIO back high.

You can see in the logic analyzer capture bellow the communication between the PIC and ESP.

The ESP8266 also controls the reset pin of the PIC.
When the ESP8266 starts, it holds the PIC in reset and only when the ESP manages to connect to the TCP server, it will release the PIC from reset.
This way if the ESP8266 powers-up or resets, it can make sure that it’s always in sync with the PIC.

If the ESP8266 disconnects from the TCP server, it will continuously try to re-connect.

The software for both the PIC and the ESP can be downloaded from here

Playing/recording audio
In order to transfer the audio data over WIFI, the ESP will connect to a WI-FI hotspot.

After the connection to WI-FI is successfully established, ESP tries to make a connection to a TCP server, using it’s IP address and port 1235.
The listening TCP server is actually very simple to use from the Linux command line.
It uses the versatile “nc” command which can open a TCP/UDP server or client, and transfer information.
In order to record audio data we use the command:

nc -l 1235 > audio-data

This command will put all the received samples in a file named audio-data.

To play the recorded audio data we can pipe the samples from the audio-data file to the “aplay” command:

cat audio-data | aplay -r 10000 -f U8

The “-r 10000″ parameter means a 10KHz sample rate and the “-f U8″ parameter means that the samples are 8 bits unsigned values.

Or if we need to listen live, we can use the combination of “nc” and “aplay” commands:

nc -l 1235 | aplay -r 10000 -f U8

Encountered problems

By far the worst problem encountered was the SPI interface on the ESP8266.
I configured the ESP8266 as a SPI master and the PIC as a slave.
The code I used for the SPI interface was the one from the SDK example with some modifications.
When transferring over SPI the 512 bytes from the PIC, the ESP would reset.
I tried with different SPI clock speeds from 8MHz to 500KHz. The result was the same.
If I tried to transfer less bytes over SPI, everything would be OK, but 512 bytes would always reset my ESP.
Then I tried to use the high speed UART interface between the two CPUs and the result was the same: transfering 512 bytes resets the ESP.
So no SPI and no UART, how can I transfer data from one CPU to the other?
The solution was to bitbang the SPI interface. This way, even though the communication is slower, at least I can transfer all the data without my ESP crashing.
I can’t understand why transferring this amount of data over SPI or UART would reset the ESP, but without a proper datasheet it’s hard to tell.

Unfortunately, the ESP8266 datasheet has only some very superficial description of component blocks and almost nothing about the registers.
This is not really a datasheet but more of an overview of the device.
Also the API code from the SDK is contained inside already compiled libraries without you being able to see the code.
The ESP is not really that Open Source friendly. :(

The schematic looks ghetto, but it’s easy to understand it

5 comments on “Voice over WI-FI with ESP8266

  1. bushra September 6, 2017 3:32 pm

    my project is same …can you share the source code
    i have to send voice data between 2 esp8266. time is short :/

  2. Andriy March 14, 2018 6:43 am

    Nice project!
    Why do you need to use second microcontroller here? Can’t ESP8266 do sampling alone?

    • admin March 14, 2018 7:57 am

      Hi Andriy,

      You can’t use the ESP alone because it has a very low sampling rate.
      Another option would be to use ESP with an I2S microphone https://www.adafruit.com/product/3421 as others did.
      Of course, the I2S microphone has a microcontroller of some kind in it.


  3. vlad April 30, 2018 3:24 am


    Very nice project, i have installed all the software and configured the services as per your tutorial. Still i have some issues and i am kindly asking for your help. When i power the nodemcy and i read the serial port i get “shit’s broken” until i start the tcp server. After that it connects and i creates the file audio-data but contains one and the same character inside multiple times. I got different files 95-100k. After a few seconds ooks like the nodemcu reboots (wdt reset) and i get the following output on serial. What might be the problem ? Thanks for your support.

    Now we are connected to TCP server

    ets Jan 8 2013,rst cause:4, boot mode:(3,4)

    wdt reset
    load 0x40100000, len 27880, room 16
    tail 8
    chksum 0x2e
    load 0x3ffe8000, len 900, room 0
    tail 4
    chksum 0x21
    load 0x3ffe8390, len 1260, room 4
    tail 8
    chksum 0xf0
    csum 0xf0

Leave a Reply to seb Cancel Reply

Your email address will not be published.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>