Microsoft announced the creation of a new type of AI that can easily generate realistic-sounding speech with all the necessary intonations. The tool was named “VALL-E“. The most curious thing is that it is enough for him to analyze only 3 seconds of audio recording of the voice of any person in order to perfectly copy it.
Microsoft does not disclose the principles of the new AI, is not going to lay out its source codes and is even unlikely to create a public commercial tool based on VALL-E. Rather, it is an experiment, an intermediate stage in the development of an addition to another language model – GPT-3. Microsoft’s ultimate goal is probably to create a universal speech generator that could replace the work of people when creating arbitrary content.
The main difficulty, and the developers do not hide this, is the need to create some markers that would help distinguish the voice of AI from the voices of real people. Otherwise, such a tool will quickly find application among intruders, because it is enough to go to any page in social networks and “borrow” from there samples of voices from numerous personal videos of users. What can we say about the public speeches of politicians and celebrities – with this AI, scammers will be able to easily call and impersonate a well-known person for selfish purposes.