The proposed audio codec (audio compression and decompression method) is optimized for the voice.
It is based on FFT (Fast Fourier Transform), the greatest points and the most energetic bands. It can also use only the local peaks.
It breaks the audio frames in two planes composed of a foreground (larger points or points of greater magnitude) and a background (the most energetic bands).
In the simplest and fastest version, one must:
- Choose a sampling frequency (eg 8 or 16 kHz).
- Choose a FFT buffer size (eg 256 or 512).
- Choose a number N for the largest local peaks (eg 16).
See notes below.
- Make a FFT transform on the buffers.
- Encode N magnitudes (for example using the logarithm).
- Send N positions or relative positions and N encoded magnitudes.
The positions represent the frequencies and must be accurate.
- For decompression, for each non-zero point, set the amplitude of cosines
to 0 and the amplitude of sinus to the value if the decoded magnitude.
- For the decompression, perform the inverse FFT (iFFT).
There is no need to take into account phases or magnitude sign.
There is no frame overlapping.
The algorithm is very effective against the very small background noises since it ignores them.
However, for more quality and in order to remove all noise (caused by the edge effects), one must take into account of the phases for the first or the greatest points or local peaks and implement a 50% or less frame overlapping.
Furthermore, for more quality, one must use the greatest points and the most energetic bands.
Note that if the algorithms are applied to non-audible signals with frame overlapping, the importance of the phases and the laterals can be really negligible. If one needs only the energy or the frequencies, the phases and the frame overlapping can be ignored.
- In the frequency domain, a local peak is a point whose magnitude is larger than that of the points immediately to the left and right.
- The quality of the codec depends on the number of local peaks and the quality of the encoding. A good quality is obtained with N between 8 and 32 and the logarithmic encoding.
- To speed up the calculations, it is not necessary to sort all points.
One must choose a fast algorithm allowing to stop at N points.
- This case corresponds to a lack of background or to a background composed of two points bands only. With local peaks, a band has only one useful point, so we can only take into account the foreground.
This method will be used for VLC (Very Low Consumption).
For VLR (Very Low Radiation), we add the algorithm described below.
Note that in voice communications, there are more than 50% of silence and lots of successive and non successive repeats (in voice recognition, an important preliminary work is to remove repeats in pronounced sentences using for example phonemes models).
Limited to successive repeats and emitting only non identical successive frames, one can significantly reduce the emissions of audio frames, so the electro-magnetic emissions for the mobile phones.
- One must choose a number C of repeats credit (eg 31).
The receiver has the right to repeat C times the frame until it receives another frame containing the remaining repeats credit to cancel.
- If the repeats credit is exhausted, the transmitter sends the current frame with the repeats credit to cancel to zero.
- If the repeats credit is not exhausted, the transmitter compares the current frame with the previous frame (each frame contains N positions and N magnitudes).
- If both frames are identical or almost identical (a similarity index is to be defined by the sender), the transmitter sends nothing.
- If the two fields are different, the transmitter sends the current frame with a number indicating the remaining repeats credit to cancel.
The magnitudes and the relative positions are very redundant, chiefly at 8 kHz sampling rate, with a small precision for the magnitudes. For the communications, we will add a frame by frame lossless compression. For the files and the safe media, it will be possible to add the LZW lossless compression. For the safe media, one will consider a group of frames for the construction of the dictionary, with a complete flush of compressed data between frames.
- If there are 31 frames per second in normal mode, a repeats credit of 31 represents one second and a repeats credit of 62 represents two seconds. One can easily achieve very low rates with these values.
- This algorithm is chiefly useful for the small buffers, the silences and the stationary parts.
- Assuming 31 frames per second and 127 or 255 for the repeats credit (7 or 8 bits):
- In the areas of silence, one byte is transmitted every 4 or 8 seconds;
- In the stationary areas, one frame is transmitted every 4 or 8 seconds.
- If very small FFT buffers are used, the successive frames of the voice and the music are actually very redundant. If the latencies are more important than the instantaneous compression ratios (notably for some vocal communications), these algorithms can be used to have good latencies while emitting the least possible number of frames and while having good average compression ratios.
- If HRTF filters are used (as with the spatial audio in the binaural listening), the VLC and VLR algorithms do not increase the algorithmic latencies, since there is no need to redo FFT and iFFT (inverse FFT) in order to apply these filters in the frequency domain.
- The so-called VLC method is patented in France and is being studied in the U.S. (USPTO).
- The so-called VLR method is being studied in France (INPI).
- A proprietary and the non-optimized implementation of VLC can be found in the WhMic software (running on Windows, WhMic
) as WHM Voice. One must install the program in demo mode (without the Web Server option), choose the WHM Voice codec to communicate, and possibly change the default sampling rate (which is 22 kHz).
- A portable implementation can be found in the PJSIP library at the following address:
vlrPhone - PJSIP