The VLC and VLR codecs and the Emotion

Taking the Emotion into Account

The purpose of this document is to describe how one can use the VLC and VLR codecs to take the emotion into account.
For more information on the VLC and VLR codecs, see at the following addresses:
   Algorithms
   Home Page

The codebook version and the unilateral codebook version will be particularly used.
For more information on these versions, see at the following addresses:
   Codebook Version
   Unilateral Codebook Version

There are several databases located locally or on remote servers. Each database represents a complete codebook. There is only one type of emotion per database.
The communications are done with the VLC and VLR codecs, including a codebook version. For each frame, the similarity searches in the databases are done with vectors: the vectors of the magnitudes and the vectors of the positions. The vectors of the magnitudes represent the timbre while the vectors of the positions represent the frequencies.
On the receipt of a frame, a search request for similarity is sent to the databases. For each type of vector, the answer can be a discrete value (there is a neighboring vector in the base, or there is no neighboring vector in the base), or a floating value (a real indicating the distance separating the vector from the nearest vector in the database).
A set of several answers is sent to a classifier to decide the type of emotion. The classifier was previously trained with the database data.



One can use multiple databases located locally or on remote servers to add emotion to voice generators. The Tacotron 2 project (neural network architecture) generates synthetic voices that are almost identical to real voices, by matching characters included in Mel-scale spectrograms and time domain signals.
For more information on the Tacotron 2 project, see at the following address:
   Tacotron 2

The vector of the magnitudes and the vector of the positions of the VLC and VLR codecs can be used to create such spectrograms. Just change the database to change the type of emotion. The low quality versions of the VLC and VLR codecs can also be improved by adding these synthesis methods, the other method being to directly use the time domain frames used to generate the databases.



The emotion is also read on the face. The smartphone cameras can transmit pictures of the face, so they can transmit the emotion. Our codecs have been developed for the audio but can also be used for the images. The images are considered as a set of lines (horizontal or vertical). FFTs are performed on each line (horizontally or vertically), then FFTs are performed on each line of the result (vertically or horizontally) to form the k-space (after a few other small changes).

Most of the information is in the center of k-space. A line of the k-space passing through the center contains a piece of information of the whole image. The VLC and VLR algorithms can be applied to each of the lines of the k-space, in particular to a central line. As in audio, one can generate databases to identify the type of emotion from the images of the face. Generally, one makes preliminary treatments on the image so that only the edges remain: one removes the points of weak contrast and one keeps only the contours.

Using the properties of FFT, one can have databases having a triple invariance (translation, rotation and scaling):
- In the k-space, the magnitudes are invariant after simple translations.
- A rotation of the image corresponds to a rotation of the points of the space k. For invariance in rotation, it suffices to put in the database a maximum of lines Li passing through the center and making an angle Ai with the horizontal or vertical axis.
- One must consider a maximum of different scales to ensure the invariance to the scale change.



Notes

- Other codecs can be used to communicate if they do not distort frequencies and magnitudes too much, because the similarity searches are based on the magnitudes of the points or local peaks. These include the PCM (WAVE), the Mu-Law or the A-Law, and the ADPCM. After decompression, one works with the VLC and VLR codecs. Some parameters must remain identical to those used for the database generation (sampling rate, number of bits per sample, size of FFT buffers, number of foreground points, number of background bands, ...).



- If we search for an existing vector in a database, even huge, we find it very quickly. If we reuse samples used to generate the bases, we must easily find the type of emotion without any error, even with a simple classifier.



- With the voice search, one is talking to servers that are remote machines. These methods can be used by these servers to be sensitive to the emotion and refine the results.



- Similarly, the chatbots can use these methods to be sensitive to the emotion and refine the responses.



- In the medical field, the raw data from MRI (Magnetic Resonance Imaging) are not pixels but k-space lines. By using the artificial intelligence and these methods, from time to time, one can only generate a few lines to follow the evolution of a disease and reduce the duration of the analyzes.



- In the medical field, the principle of the computed tomography (CT) is based on the Radon's theorem (1917) which describes how it is possible to reconstruct a two-dimensional geometry of an object from a series of projections measured all around. The Fourier transform of a projection corresponds to a line of the Fourier transform of the image which passes through the origin and makes an angle A with the abscissa axis (central-slice theorem).
The CT uses X-rays. From time to time, one can use only a few cross-sections to follow the evolution of a disease and reduce the X-ray doses.