OpenAI created a voice copying technology, however it is not yet available.

OpenAI is actively advancing voice replication technology while upholding a strong commitment to responsible development.

Today is the exciting launch of OpenAI’s Voice Engine, which expands their already impressive text-to-speech API. After being in development for approximately two years, Voice Engine now enables users to easily upload a 15-second voice sample and effortlessly generate a synthetic replica of that voice. However, the company has not yet announced the public availability date to address any concerns regarding the model’s usage and potential misuse.

“It is crucial to ensure that all stakeholders have confidence in the deployment process. We must thoroughly assess the potential risks associated with this technology and implement appropriate measures to address them,” said Jeff Harris, a member of the product staff at OpenAI, during an interview with Eltrys.

Overseeing the model’s training process
The AI model behind Voice Engine has been readily available for quite some time, according to Harris.

ChatGPT, OpenAI’s AI-powered chatbot, utilizes a similar model to power its voice and “read aloud” capabilities. Additionally, OpenAI’s text-to-speech API offers a range of preset voices. Spotify has been utilizing this technology since early September to provide dubbed podcasts in various languages for esteemed hosts such as Lex Fridman.

I inquired with Harris about the origin of the model’s training data, which seemed to be a sensitive topic. Harris simply mentioned that they trained the voice engine model using a combination of licensed and publicly accessible data.

Voice engine-like models undergo training on a vast array of examples, specifically speech recordings, typically sourced from various public sites and internet datasets. Some generative AI vendors tend to be quite secretive about their training data, considering it a valuable asset that gives them a competitive edge. However, sharing training data details can also lead to potential IP-related lawsuits, which serves as a deterrent to disclosing too much information.

OpenAI is currently facing a lawsuit for allegedly infringing on IP law. The lawsuit accuses the company of training its AI on copyrighted content, including photos, artwork, code, articles, and e-books, without providing proper credit or compensation to the creators or owners.

OpenAI has established licensing agreements with various content providers, such as Shutterstock and the news publisher Axel Springer. Additionally, OpenAI provides webmasters with the ability to prevent its web crawlers from extracting data from their site for training purposes. OpenAI provides artists with the option to exclude and delete their work from the datasets used to train its image-generating models, such as the recent DALL-E 3.

However, OpenAI does not provide an opt-out scheme for its other products. In a recent statement to the U.K.’s House of Lords, OpenAI made the bold claim that creating useful AI models without copyrighted material is an impossible task. They argued that fair use, which permits the use of copyrighted works to create something new, protects them when it comes to model training.

Bringing together different elements to create a harmonious voice
Interestingly, user data does not train or fine-tune voice engines. The reason behind this is the transient nature of the model’s speech generation, which combines a diffusion process and transformer.

“We have the ability to generate realistic speech that matches the original speaker using a small audio sample and text,” explained Harris. Once the request is complete, we discard the used audio.

According to his explanation, the model is capable of analyzing both speech data and text data and generating a matching voice without the need for custom models for each speaker.

This technology is not groundbreaking. Several startups have been offering voice-cloning products for quite some time. These include ElevenLabs, Replica Studios, Papercup, Deepdub, and Respeecher. Big Tech incumbents like Amazon, Google, and Microsoft—the latter being a significant investor in OpenAI, by the way.

Harris argued that OpenAI’s approach produces speech of superior quality.

We understand that the pricing will be competitive. OpenAI has decided to omit the pricing details for Voice Engine in their recent marketing materials. Eltrys reviewed documents and found that Voice Engine charges $15 for every one million characters, roughly equivalent to 162,500 words. That would be a perfect fit for Dickens’ “Oliver Twist,” with some extra space to spare. (A high-definition quality option is available at double the price, although it is worth noting that an OpenAI spokesperson informed Eltrys that there is no discernible distinction between HD and non-HD voices. Interpret that as you wish.

That amounts to approximately 18 hours of audio, resulting in a price of less than $1 per hour. That’s definitely a more affordable option compared to what ElevenLabs, one of the more popular rival vendors, charges: $11 for 100,000 characters per month. However, there is a trade-off in terms of customization.

Unfortunately, Voice Engine does not provide any options to modify the tone, pitch, or cadence of a voice. Currently, there are no fine-tuning options available, as mentioned by Harris. However, Harris assures that future iterations will preserve the expressiveness in the 15-second voice sample. For instance, if you speak with excitement, the synthetic voice will consistently reflect that excitement. We will assess the reading quality by comparing it directly with other models.

Voice talent is seen as a commodity.
Salaries for voice actors on ZipRecruiter can vary significantly, ranging from $12 to $79 per hour. It’s worth noting that these rates are generally higher compared to voice engines, even at the lower end. It’s important to consider that voice actors with agents may charge a higher price per project. If OpenAI’s tool gains popularity, it has the potential to make voice work more accessible and affordable. So, what is the impact on actors?

The talent industry has long been aware of the potential impact of generative AI and has been actively addressing this existential threat. Clients are increasingly asking voice actors to relinquish their voice rights, enabling them to use AI technology to create synthetic versions that could potentially replace them in the future. AI-generated speech poses a threat to voice work, particularly low-cost and beginner-level opportunities.

Now, certain AI voice platforms are attempting to find a middle ground.

Last year, Replica Studios entered into a deal with SAG-AFTRA to create and license copies of the voices of media artist union members. There was some controversy surrounding the agreement. The organizations stated that the agreement established equitable and ethical terms and conditions to guarantee the consent of performers when negotiating the use of synthetic voices in new works, such as video games.

Meanwhile, ElevenLabs hosts a marketplace for synthetic voices where users can create, verify, and share their voices publicly. When others utilize a voice, the original creators are duly compensated with a fixed dollar amount for every 1,000 characters.

OpenAI has no plans to establish any labor union deals or marketplaces in the near future. However, they do require users to obtain explicit consent from the individuals whose voices are cloned. Additionally, users must make clear disclosures to indicate which voices are AI-generated and agree not to use the voices of minors, deceased individuals, or political figures in their generated content.

“We are closely monitoring how this intersects with the voice actor economy and are genuinely curious about it,” Harris stated. I believe there will be ample opportunities to expand your reach as a voice actor with this technology. However, this is all information that will be acquired through hands-on experience as individuals begin to implement and experiment with the technology.”

Exploring the ethical implications of deepfakes
Voice cloning apps have been misused in various ways that extend far beyond the potential harm to actors’ careers.

An online message board, 4chan, notorious for its controversial content, utilized ElevenLabs’ platform to disseminate derogatory messages imitating well-known figures such as Emma Watson. James Vincent from The Verge utilized AI tools to rapidly clone voices, resulting in the creation of samples that included violent threats and offensive remarks of a racist and transphobic nature. Meanwhile, at Vice, journalist Joseph Cox detailed the creation of a remarkably realistic voice clone that successfully deceived a bank’s authentication system.

There are concerns that individuals with malicious intent may try to influence elections through the use of voice cloning technology. And their concerns are not without basis: In January, a phone campaign utilized a deepfaked President Biden to discourage New Hampshire citizens from voting. This action prompted the FCC to take steps towards making similar campaigns illegal in the future.

What measures is OpenAI implementing to ensure that Voice Engine is not misused, in addition to policy-level restrictions on deepfakes? Harris mentioned a few.

Initially, Voice Engine will only be accessible to a select group of developers, numbering around 10. OpenAI is focusing on use cases that are considered to be low risk and socially beneficial, such as healthcare and accessibility. They are also exploring the possibilities of responsible synthetic media.

Some of the early adopters of Voice Engine include Age of Learning, an edtech company utilizing the tool to create voice-overs from pre-cast actors, and HeyGen, a storytelling app that is leveraging Voice Engine for translation purposes. Companies such as Livox and Lifespan are utilizing Voice Engine technology to develop customized voices for individuals facing speech impairments and disabilities. Additionally, Dimagi is actively working on a Voice Engine-powered tool that will provide health workers with feedback in their native languages.

Additionally, clones generated with Voice Engine undergo a watermarking process that incorporates OpenAI’s unique method of embedding imperceptible identifiers into the recordings. (Other vendors, such as Resemble AI and Microsoft, also use similar watermarks.) Harris mentioned that while there may be ways to bypass the watermark, it is designed to be “tamper resistant.”

“When analyzing an audio clip, it is quite simple for us to identify if it was created by our system and the specific developer responsible for its generation,” Harris explained. Currently, it is not open sourced; we are using it internally at the moment. We’re considering the possibility of making it publicly accessible, but we need to be cautious about the potential risks and vulnerabilities that may arise.

Additionally, OpenAI intends to grant members of its red teaming network, a specialized group of professionals who contribute to the company’s assessment and mitigation of AI model risks, the ability to utilize Voice Engine for the purpose of identifying and addressing potential malicious applications.

Some experts suggest that AI red teaming may not cover all possible scenarios, and they believe that it is the responsibility of vendors to create tools that can protect against any potential harm caused by their AI systems. OpenAI is not taking such extreme measures with Voice Engine, but Harris emphasizes that the company’s utmost priority is to ensure the safe release of the technology.

Ready for distribution
Considering the outcome of the preview and the response from the public towards Voice Engine, OpenAI may consider making the tool available to a larger group of developers. However, the company is currently hesitant to make any definite commitments.

Harris provided a glimpse into Voice Engine’s future plans, disclosing that OpenAI is currently experimenting with a security feature. This feature requires users to read randomly generated text as a way to verify their presence and ensure they are conscious of how their voice is being utilized. According to Harris, this development has the potential to boost OpenAI’s confidence in expanding the reach of Voice Engine to a wider audience. It could also mark the start of something even bigger.

“The progress of the voice matching technology will be driven by the insights gained from the pilot, the identification of safety concerns, and the implementation of necessary measures,” he stated. “Ensuring clarity between artificial voices and actual human voices is crucial.”

And I must say, we are in complete agreement on that final point.

Juliet P.
Author: Juliet P.

Share this article
0
Share
Shareable URL
Prev Post

X confirms NSFW Community intentions

Next Post

Instagram is introducing ‘Blend,’ suggested Reels for you and a buddy.

Leave a Reply

Your email address will not be published. Required fields are marked *

Read next
Subscribe to our newsletter
Get notified of the best deals on our WordPress themes.