Breaking New Ground: A Dive Into Multimodal Generative AI

Multimodal generative AI is considered the next big thing in our path to achieving Artificial General Intelligence.

October 26, 2023

what is multimodal generative AI
  • Multimodal generative AI is considered the next big thing in our path to achieving Artificial General Intelligence.
  • It is a concept devised, theorized, and now being implemented to deliver multisensory immersive experiences.
  • It draws outputs from a combination of multiple data types to provide responses as insights, content, and more.
  • Read on to learn about multimodal generative AI, its benefits, potential and adoption, and associated issues.

The launch of multiple generative AI tools in the last few years is a testament to the breakthroughs achieved in artificial intelligence (AI) technologies in the decade past. In a relatively short history, GenAI has created a sense of exigency surrounding its adoption into any organization’s routine and niche operational elements.

While the photorealism of DALL-E, OpenAI’s first text-to-image generator, served a limited purpose for the majority of organizations, ChatGPT’s launch, which was less than one year ago, has had organizations scouring to attain an edge by looking for various avenues wherein generative AI can positively impact operations.

According to McKinsey’s The State of AI, in 2023, one-third of organizations have incorporated GenAI into at least one business function. Moreover, approximately 75% of respondents to McKinsey’s survey expect GenAI to usher in a disruptive change to their industry.

Take contact centers, for instance. Aberdeen Strategy & Research calls GenAI in the context of deployment in the contact center “an empowerment multiplier” because it can help companies achieve the following:

  • 5.2x greater year-on-year (YoY) growth in revenue
  • 4.3x higher YoY increase in customer lifetime value
  • 7.2x more YoY increase in cross-sell/up-sell revenue
  • 3.0x higher YoY increase in customer satisfaction

Findings from Aberdeen’s CX Trends 2023 survey indicate that for contact center leaders, incorporating AI capabilities is the second top technology investment priority in 2023 and the future. Tangible benefits certainly can be a motivator, which is why McKinsey found that organizations where at least 20% of EBIT in 2022 was because of AI are more likely to invest more in this tech.

AI capabilities are primarily used for product development, feature additions, customer experience, marketing functions, and more.

However, GenAI is now taking different forms known as multimodality, where it accepts multiple sensory inputs to deliver outputs in similar or different data types. For example, ChatGPT’s new update gives it multimodal abilities in that it can now not only generate stories, essays, and other text but also read them. ChatGPT can also be prompted to perform a task through voice. It can also read an image to identify specific objects.

Multimodal GenAI is considered the next big thing in our path to achieving Artificial General Intelligence. Let us take a look at what it is.

What Is Multimodal Generative AI?

Multimodal GenAI is a concept devised, theorized, and now being implemented to deliver multisensory immersive experiences. It draws outputs from a combination of multiple data types to provide responses as insights, content, and more.

Camden Swita, senior product manager at New Relic, explained to Spiceworks, “Multimodal generative AI is a type of artificial intelligence that combines multiple types or modes of data — text, images, video, audio, depth, etc., — to create more accurate determinations or make more precise predictions about ‘real-world’ settings, scenarios, or problems. These models are trained on data sets from the multiple modes/data types they need to interpret or respond in.”

Multimodal GenAI can be similar to GenAI, except it leverages multidimensional embeddings or indexing and can rely on vector databases for operation. At the core of this difference is that multimodal GenAI can ingest, process, and output multiple types of data thanks to multidimensional embeddings or indexing.

Meta’s ImageBind multimodal AI, released earlier this year, goes a step further from ChatGPT and integrates six modalities, viz., text, image/video, audio, depth, infrared-based thermal radiation, and inertial measurement unit (IMU). The company also touted the integration of even more senses, including olfactory and haptics, and brain fMRI signals as part of multimodality research.

See More: How is Generative AI Forcing Software Development to Evolve?

Multimodal Generative AI Benefits

Combining and processing information from multiple sources can potentially homogenize discrepancies to deliver contextually relevant results. In a corporate environment, this can stimulate greater employee productivity.

Multimodal GenAI can reshape the user experience for both the end-user and the business user by creating new avenues for machine interaction.

It also presents certain societal and scientific benefits, given its potential for deployment across research into the physical sciences, life/biological sciences, and social science. Even before the rise of GenAI, in June 2021, Google was able to leverage its reinforcement learning algorithm-based machine learning models to execute floorplanning for semiconductors, a key step in chip designing.

“Ultimately, we’re talking about tireless tools that can make increasingly accurate determinations and predictions in multisensory/multimodal spaces based on vast stores of data across multiple modalities. These can not only be trained on the data much faster than a human but can also make decisions based on it faster, too,” Swita added.

“Assuming the accuracy of output continues to improve, the implications for innovation are hard to really wrap my head around. There are a lot of ‘ifs’ around the availability of good training data in certain modalities and considerations around the cost of inference. But those loose ends will likely get tied off sooner than some people might think.”

Multimodal Generative AI Adoption

Swita highlighted that multimodal GenAI has already seen some applications for GenAI, such as Adobe’s Firefly – text-to-image multimodality and MidJourney. Note that the two GenAIs’ multimodality enables it to accept audio and visual input.

Another business use case we already see multimodality in, according to Swita, includes a phone-based automated support system, which “might translate the sentiment apparent in our tone of voice into textual data the company can use for reporting and analysis.”

On the flip side, it can also be used to the detriment of users. “Other ‘businesses’ have started using text-to-audio multimodal models to generate more realistic and dynamic voices to scam people over the phone.”

However, like previous technological inventions, multimodal and regular GenAI allow dozens of professions to evolve. Lawyers, writers, scientists, teachers, and more could optimize time-consuming tasks such as research, strategy development, document drafting and generation, and more, provided it falls under the purview of the underlying data the multimodal GenAI tool is trained on.

In short, the knowledge economy could see a massive shift if the right data is available.

That’s a big “if.”

This is precisely why we are years from mainstream use that permeates deep into the social fabric, primarily because of the technical difficulties in creating multimodal AI and secondarily owing to present-day limitations in data.

“It may be a couple of years or more before we see widespread use of more complicated multimodal generative AI applications in a major industry,” Swita continued.

“All in all, the ability to use multimodal generative AI in meaningful ways will not only be complicated because the models themselves are more complicated and expensive to make, and due to the sensitivity of the data in the examples above, the red tape around making and using them may be trickier to cut through. All of which will likely slow widespread adoption.”

Still, it is quite enticing to imagine and equally difficult to predict how this emerging tech will shape businesses, society, governments, etc. Notwithstanding former Microsoft CEO Steve Ballmer’s 2007 statement on heavily discounting the success of the iPhone: “There’s no chance that the iPhone is going to get any significant market share,” or Apple founder and CEO Steve Jobs writing off the business models of Spotify and multiple music subscription companies, saying, “The subscription model of buying music is bankrupt,” we asked Swita to take a swing at the potential of multimodal GenAI.

“Multimodal generative AI can be leveraged to improve quality control in manufacturing, predictive maintenance of automobiles, and supply chain optimization in manufacturing. In healthcare, it can process a patient’s vital signs, diagnostic data, scan images, appearance, and other text/image/audio/video records to improve diagnoses and treatment plans. In retail, it can analyze data from various sources, including sensors, cameras, and audio recordings, to identify patterns and predict future customer behavior. And so much more.”— Camden Swita, senior product manager at New Relic.

Broadly speaking, multimodal GenAI can spawn a new take on the visual element of development. “We should see some types of visual and interaction design be disrupted, as it’s fairly likely that we’ll be able to generate some aspects of software frontends and user experiences using models that can take text input or visual examples and generate the design assets and frontend code needed to implement them,” Swita opined.

Further, Swita expects virtual services to emerge in patient-facing healthcare operations and multiple other industries by facilitating creativity in AR/VR. Improvements in immersive technology have “obvious applications in the entertainment industry, but could also make new and exciting things possible in the MedTech/accessibility device industries, for manufacturing, or even for “knowledge work” industries, like design and architecture.”

See More: Multimodality: A Must for Effective Human-AI Collaboration

Issues With Multimodal Generative AI

ChatGPT’s rise proves how regular GenAI can be a source of conflict. ChatGPT creator OpenAI even earned a subpoena in July this year by the FTC seeking answers on dozens of concerns, including but not limited to:

  • Data collection practices
  • Algorithms management
  • AI hallucination (i.e., making false claims about individuals, etc.)
  • Corporate governance model
  • Security vulnerability or incident management
  • Whether it upholds the appropriate security
  • Marketing efforts

At Dreamforce 2023 in September, OpenAI CEO Sam Altman referred to GenAI systems hallucinating as a value proposition. “The fact that these AI systems can come up with new ideas and be creative is a lot of their power. You want them to be creative when you want, and that’s what we’re working on,” Altman said.

However, the degenerative effects of AI models learning on (possibly incorrect) data created by yet another GenAI spiraling into a chain of misinformation is becoming apparent on social media. This is why Altman’s value proposition is debatable. What isn’t, however, is that hallucinating GenAI systems should be on a short leash when online.

Additionally, undefined laws about using training data for large language and other models and the output generated by GenAI tools are brewing as a source of friction amongst AI developers and creators.

Microsoft will also have to deal with the fallout of its lawsuit not only against OpenAI (seeing as it pumped in $11 billion so far) but also against GitHub, where the litigants claimed the company used licensed code to train its AI coding assistant, Copilot.

Meanwhile, Google was sued in July 2023 for privacy and copyright violations in training Bard, DuetAI, Imagen, and Gemini on data collected from the internet. Further, Getty Images (US) sued Stability AI (over Stable Diffusion) in February 2023, claiming the reproduction of its copyrighted material.

For instance, several authors, including former Arkansas Governor Mike Huckabee, actor and comedian Sarah Silverman, and novelists George R.R. Martin, Richard Kadrey, Michael Chabon, Christopher Golden, Mona Awad, Jodi Picoult, and Paul Tremblay, are suing OpenAI for copyright infringement, claiming that ChatGPT was trained on books they authored without their permission.

Here, there is little anyone can do except engage lawmakers on Capitol Hill, thus helping the law take its course.

Meanwhile, high-quality and relevant data availability is one of the most crucial elements of any GenAI, multimodal or unimodal. Companies and websites that have so far handed free reign over their data are voicing their demand for payment. Reddit became a center of controversy over its API pricing policy changes earlier this year, affecting third-party developers, which it later clarified is intended to get paid for their valuable data being scraped and used to train GenAI models.

Reportedly, Reddit is now repositioning its claim by threatening to block Google and Bing web crawlers. X (formerly Twitter), too, created new API pricing tiers in March 2023.

“It’s hard to believe that the cost of training, hosting, and using these models won’t be cheaper than more traditional alternatives sometime in the near future. It’s possible that the cost of good training data might become prohibitive (as original content creators/capturers seek to put monetization barriers around their data sets) for some time, but that will probably level out in the long run,” Swita concluded.

Multiple news publishers, including CNN, BBC, The New York Times, Reuters, and others, have already restricted Common Crawl and OpenAI crawlers.

How do you expect multimodal GenAI to grow? Share your thoughts with us on LinkedInOpens a new window , XOpens a new window , or FacebookOpens a new window . We’d love to hear from you!

Image source: Shutterstock

MORE ON GENERATIVE AI

Sumeet Wadhwani
Sumeet Wadhwani

Asst. Editor, Spiceworks Ziff Davis

An earnest copywriter at heart, Sumeet is what you'd call a jack of all trades, rather techs. A self-proclaimed 'half-engineer', he dropped out of Computer Engineering to answer his creative calling pertaining to all things digital. He now writes what techies engineer. As a technology editor and writer for News and Feature articles on Spiceworks (formerly Toolbox), Sumeet covers a broad range of topics from cybersecurity, cloud, AI, emerging tech innovation, hardware, semiconductors, et al. Sumeet compounds his geopolitical interests with cartophilia and antiquarianism, not to mention the economics of current world affairs. He bleeds Blue for Chelsea and Team India! To share quotes or your inputs for stories, please get in touch on sumeet_wadhwani@swzd.com
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.