Empowering Conversation: "Chat GPT now has the ability to Speak, Listen, and Process Images"

"Empowering Conversations: Chat GPT's Multimodal Abilities - Speech, Listening, and Image Processing" says OpenAI

Machine can talk illustration -Univesity of York

As of Monday, OpenAI's ChatGPT could "see, hear, and speak," or at the very least understand spoken language, give responses in a synthetic voice, and process images.

Users are able to opt into voice discussions on ChatGPT's mobile application and select one of five artificial voices for the chatbot to answer, thanks to the update, OpenAI's biggest since the launch of GPT-4. Users will also be able to upload photographs to ChatGPT and annotate or highlight certain regions within them (for example, "What kinds of clouds are these?").

More methods to use ChatGPT in your life are made possible by voice and image. While traveling, take a picture of a notable location and engage in a live discussion about what is intriguing about it. As soon as you go home, take images of the fridge and pantry in order to plan your supper (ask follow-up questions for a recipe with step-by-step instructions). Support the kid in your life with a math problem after dinner by snapping a picture, circling the problem, and having it share tips with both of you.

Over the following two weeks, ChatGPT Plus and Enterprise subscribers will start receiving voice and image support. Images will be available across all platforms, and voice will soon be accessible through iOS and Android (opt-in in your settings).

Talk to ChatGPT and hear it respond.

Now, you can talk back and forth with your assistance using your voice. Use it to communicate while you're on the move, ask for a family goodnight story, or resolve a conflict at the dinner table.

Go to Settings New Features on the mobile app and select "Voice Conversations" to begin using voice. Once you've selected your preferred voice from a list of five options, hit the headphone button in the top-right corner of the home screen.

The new voice functionality is supported by a new text-to-speech algorithm that can produce human-like sounds using only text and a short sample of speech. Each voice was produced in collaboration with experienced voice performers. Your spoken words are also converted into text using Whisper, an open-source speech recognition system.

Playback of voice samples

In a pocket or under a tree,

Oh, where could my little keys be?

I checked in the fridge, behind the TV,

Even the cat looked up, as if to plea.

They jingle, they jangle, they open the door,

Yet they always end up on a different floor.

I searched high and low, left and right,

Why must you vanish, out of plain sight?

Talk about pictures

Users can now display one or more photos in ChatGPT. Investigate the cause for your grill's failure to ignite, look through the refrigerator to make a meal plan, or examine a challenging graph to find information relevant to your job. Using the sketching tool in the mobile app, you can concentrate on a particular region of the image.

Start by using the photo button to take or select an image. You should first press the + button if you're using iOS or Android. You can instruct your assistant by using a drawing tool, talking about numerous images, or both. Multimodal GPT-3.5 and GPT-4 are the engines behind image comprehension. These models use screenshots, pictures, and papers with both text and images to apply their language-thinking skills to a variety of images.

Image and audio capabilities will be steadily implemented.

OpenAI's mission centers around the development of Artificial General Intelligence (AGI) that prioritizes safety and societal benefits. OpenAI author said we are committed to a gradual release approach for our AI tools. This deliberate pace enables us to continuously enhance the technology, fine-tune risk management strategies, and, importantly, ensure that everyone is well-prepared for the advent of increasingly powerful systems. This strategy takes on heightened significance when we're dealing with advanced models that encompass capabilities such as voice and vision.

Voice Input

The emerging voice technology, with its ability to create highly lifelike synthetic voices from just a short snippet of real speech, offers exciting opportunities for creativity and accessibility. Nonetheless, these capabilities come with fresh challenges, including the risk of misuse, such as malicious individuals impersonating public figures or engaging in fraudulent activities.

This is precisely why OpenAI has harnessed this technology for a specific purpose: voice chat. In developing voice chat, They have engaged directly with professional voice actors. Moreover, They are extending thier collaborative efforts to other partners. An excellent example of this collaboration is with Spotify, where the technology is being leveraged for the initial phase of their Voice Translation feature. This innovation empowers podcasters to broaden the reach of their narratives by translating podcasts into various languages while preserving their own distinctive voices.

Image Input

Models based on vision capabilities introduce a host of fresh challenges, including the generation of false images involving people and the model's reliance on its interpretation of images, especially in critical domains. Before proceeding with wider implementation, they conducted rigorous testing of the model. This involved engaging red teamers to evaluate potential risks in areas such as extremism and scientific competence, as well as soliciting feedback from a diverse group of alpha testers. This extensive research effort has allowed OpenAI to establish clear guidelines for the responsible utilization of this technology.

Enhancing vision for Safety and Use

Similar to other features of ChatGPT, the addition of vision capabilities aims to enhance and facilitate your daily activities. It is most effective when it can perceive and understand the visual context just as you do, allowing for a more seamless and integrated experience.

This approach has been significantly influenced by collaboration with Be My Eyes, a free mobile application designed for individuals with visual impairments. This partnership has provided invaluable insights into both the applications and constraints of this technology. Users have shared that they find immense value in having casual conversations about images, particularly those that incidentally feature people in the background. For instance, discussing a person who appears on television while you're trying to adjust your remote control settings can be quite helpful.

To uphold individuals' privacy and maintain responsible usage, OpenAI has implemented strict technical measures to restrict ChatGPT's capacity to analyze images and provide direct statements about people. It's important to note that these systems may not always offer precise assessments, and as such, this approach is designed to prioritize privacy and accuracy.

The Author also said Real-world usage and the feedback we receive are invaluable in our ongoing efforts to enhance these safeguards, all while ensuring the tool remains practical and beneficial for users. Your input and experiences play a vital role in shaping the continuous improvement of our technology.

Transparency regarding the restrictions of the model

OpenAI author said in line with new announcemnets we recognize that users may rely on ChatGPT for specialized topics, including research. In line with our commitment to transparency, we are forthright about the model's limitations and actively discourage its use in high-risk scenarios without proper verification. It's important to note that while the model excels at transcribing English text, its performance diminishes notably in some other languages, particularly those that employ non-Roman scripts. As a result, we advise our non-English users against relying on ChatGPT for such purposes to ensure the best outcomes.

For further details on safety approach and collaboration with Be My Eyes, you can refer to the information provided in the system card dedicated to image input. This resource will provide you with a comprehensive overview of the efforts in this regard.

The major feature development comes with the ongoing artificial intelligence arms race between the top chatbot developers, including OpenAI, Microsoft, Google, and Anthropic. Tech firms are rushing to release new chatbot apps and features, mainly this summer, in an attempt to persuade users to incorporate generative AI into their daily routines. Microsoft has added visual search to Bing, and Google has revealed a number of upgrades to its Bard chatbot.

Microsoft's substantial investment in OpenAI, totaling an additional $10 billion earlier this year, marked one of the most significant AI investments of the year, as reported by PitchBook. In April, OpenAI conducted a share sale, raising $300 million and valuing the startup between $27 billion and $29 billion. Notably, this funding round attracted investments from prominent firms, including Sequoia Capital and Andreessen Horowitz. These investments demonstrate the growing recognition of OpenAI's work and its importance in the field of artificial intelligence.

The concerns surrounding AI-generated synthetic voices are indeed valid. While these technologies can enhance user experiences with more natural-sounding interactions, they also raise the potential for more convincing deepfake content. Cybersecurity experts and researchers have started to investigate how deepfake technology can be exploited to breach cybersecurity systems.

The ability to create highly convincing audio and video content using AI poses risks in various domains, including disinformation campaigns, identity theft, and cyberattacks. As a result, there's a pressing need for robust methods to detect and mitigate deepfakes, as well as clear guidelines and regulations to govern their responsible use.

Balancing the benefits of AI-generated synthetic voices with the potential risks is an ongoing challenge, and it underscores the importance of both technological innovation and ethical considerations in the development and deployment of AI technologies.

In its release on Monday, OpenAI acknowledged these worries and stated that synthetic voices were "created with voice actors we have directly worked with," as opposed to being gathered from random people.

It's great to hear that there are plans to expand access to these new capabilities. In the next two weeks, Plus and Enterprise users will have the opportunity to experience voice and image capabilities, which is an exciting development. Furthermore, the intention to make these features available to a wider range of users, including developers, in the near future indicates a commitment to bringing these advanced capabilities to a broader audience. This expansion of access can lead to more innovative and diverse applications of the technology.

It's important to be aware that the company considers transcriptions as inputs, and these inputs may be utilized to enhance the performance of large-language models like ChatGPT. This practice is common in the development of AI systems, where data is used to train and improve the models. However, it's crucial for organizations to be transparent about their data usage policies and ensure that privacy and ethical considerations are taken into account when using such data for model improvement. This transparency helps users make informed choices about their data interactions with AI systems.