Maria Korolov
Contributing writer

5 ways to deploy your own large language model

Feature
Nov 16, 202310 mins
Artificial IntelligenceData and Information SecurityDatabases

Building a new large language model (LLM) from scratch can cost a company millions — or even hundreds of millions. But there are several ways to deploy customized LLMs that are faster, easier, and, most importantly, cheaper.

office workers at desk
Credit: iStock

It’s the fastest-moving new technology in history. Generative AI is transforming the world, changing the way we create images and videos, audio, text, and code.

According to a September survey of IT decision makers by Dell, 76% say gen AI will have a “significant if not transformative” impact on their organizations, and most expect to see meaningful results within the next 12 months.

A large language model (LLM) is a type of gen AI that focuses on text and code instead of images or audio, although some have begun to integrate different modalities. The most popular LLMs in the enterprise today are ChatGPT and other OpenAI GPT models, Anthropic’s Claude, Meta’s Llama 2, and Falcon, an open-source model from the Technology Innovation Institute in Abu Dhabi best known for its support for languages other than English.

There are several ways companies deploy LLMs, like giving employees access to public apps, using prompt engineering and APIs to embed LLMs into existing software, using vector databases to improve accuracy and relevance, fine-tuning existing models, or building their own.

Deploying public LLMs

Dig Security is an Israeli cloud data security company, and its engineers use ChatGPT to write code. “Every engineer uses stuff to help them write code faster,” says CEO Dan Benjamin. And ChatGPT is one of the first and easiest coding assistants out there. But there’s a problem with it — you can never be sure if the information you upload won’t be used to train the next generation of the model. Dig Security addresses this possibility in two ways. First, the company uses a secure gateway to check what information is being uploaded.

“Our employees know they can’t upload anything sensitive,” says Benjamin. “It’s blocked.”

Second, the company funnels its engineers to a version of ChatGPT running on a private Azure cloud. This means Dig Security gets its own self-contained instance of ChatGPT. Even with this belt-and-suspenders approach to security, it’s not a perfect solution, Benjamin says. “There’s no perfect solution. Any organization that thinks there is, is fooling itself.”

For example, someone can use a VPN or a personal computer and access the public version of ChatGPT. That’s where another level of risk mitigation comes in.

“It’s all about employee training,” he says, “and making sure they understand what they need to do, and they’re well trained on data security.”

Dig Security isn’t alone.

Skyhigh Security in California says that close to a million end users accessed ChatGPT through corporate infrastructures during the first half of 2023, with the volume of users increasing by 1,500% between January and June, says Tracy Holden, Skyhigh’s director of corporate marketing.

And in a July report from Netskope Threat Labs, source code is posted to ChatGPT more than any other type of sensitive data at a rate of 158 incidents per 10,000 enterprise users per month.

More recently, companies have been getting more secure, enterprise-friendly options, like Microsoft Copilot, which combines ease of use with additional controls and protections. And at the OpenAI DevDay in early November, CEO Sam Altman said there are now 100 million active users using the company’s ChatGPT chatbot, two million developers using its API, and more than 92% of Fortune 500 companies are building on top of the OpenAI platform.

Vector databases and RAG

For most companies looking to customize their LLMs, retrieval augmented generation (RAG) is the way to go. If someone is talking about embeddings or vector databases, this is what they normally mean. The way it works is a user asks a question about, say, a company policy or product. That question isn’t set to the LLM right away. Instead, it’s processed first. Does the user have the right to access that information? If the access rights are there, then all potentially relevant information is retrieved, usually from a vector database. Then the question and the relevant information is sent to the LLM and embedded into an optimized prompt that might also specify the preferred format of the answer and tone of voice the LLM should use.

A vector database is a way of organizing information in a series of lists, each one sorted by a different attribute. For example, you might have a list that’s alphabetical, and the closer your responses are in alphabetical order, the more relevant they are.

An alphabetical list is a one-dimensional vector database, but vector databases can have an unlimited number of dimensions, allowing you to search for related answers based on their proximity to any number of factors. That makes them perfect to use in conjunction with LLMs.

“Right now, we’re converting everything to a vector database,” says Ellie Fields, chief product and engineering officer at Salesloft, a sales engagement platform vendor. “And yes, they’re working.”

And it’s more effective than using simple documents to provide context for LLM queries, she says.

The company primarily uses ChromaDB, an open-source vector store, whose primary use is for LLMs. Another vector database Salesloft uses is Pgvector, a vector similarity search extension for the PostgreSQL database.

“But we’ve also done some research using FAISS and Pinecone,” she says. FAISS, or Facebook AI Similarity Search, is an open-source library provided by Meta that supports similarity searches in multimedia documents.

And Pinecone is a proprietary cloud-based vector database that’s also become popular with developers, and its free tier supports up to 100,000 vectors. Once the relevant information is retrieved from the vector database and embedded into a prompt, the query gets sent to OpenAI running in a private instance on Microsoft Azure.

“We had Azure certified as a new sub-processor on our platform,” says Fields. “We always let customers know when we have a new processor for their information.”

But Salesloft also works with Google and IBM, and is working on a gen AI functionality that uses those platforms as well.

“We’ll definitely work with different providers and different models,” she says. “Things are changing week by week. If you’re not looking at different models, you’re missing the boat.” So RAG allows enterprises to separate their proprietary data from the model itself, making it much easier to swap models in and out as better models are released. In addition, the vector database can be updated, even in real time, without any need to do more fine-tuning or retraining of the model.

“We’ve switched out models, from OpenAI to OpenAI on Azure,” says Fields. “And we’ve switched among different OpenAI models. We may even support different models for different parts of our customer base.”

Sometimes different models have different APIs, she adds. “It’s not trivial,” she says. But switching out a model is still easier than retraining. “We haven’t yet found a use case that’s better served by fine tuning rather than a vector database,” Fields adds. “I believe there are use cases out there, but so far, we haven’t found one that performs better.”

One of the first applications of LLMs that Salesloft rolled out was adding a feature that lets customers generate a sales email to a prospect. “Customers were taking a lot of time to write those emails,” says Fields. “It was hard to start, and there’s a lot of writer’s block.” So now customers can specify the target persona, their value proposition, and the call to action — and they get three different draft emails back they can personalize. Salesloft uses OpenAI’s GPT 3.5 to write the email, says Fields.

Locally run open source models

Boston-based Ikigai Labs offers a platform that allows companies to build custom large graphical models, or AI models designed to work with structured data. But to make the interface easier to use, Ikigai powers its front end with LLMs. For example, the company uses the seven billion parameter version of the Falcon open source LLM, and runs it in its own environment for some of its clients.

To feed information into the LLM, Ikigai uses a vector database, also run locally. It’s built on top of the Boundary Forest algorithm, says co-founder and co-CEO Devavrat Shah.

“At MIT four years ago, some of my students and I experimented with a ton of vector databases,” says Shah, who is also a professor of AI at MIT. “I knew it would be useful, but not this useful.”

Keeping both the model and the vector database local means no data can leak out to third parties, he says. “For clients who are okay with sending queries to others, we use OpenAI,” says Shah. “We are LLM agnostic.”

PricewaterhouseCoopers, which built its own ChatPWC tool, is also LLM agnostic. “ChatPWC makes our associates more capable,” says Bret Greenstein, the firm’s partner and leader of the gen AI go-to-market strategy. For example, it includes pre-built prompts to generate job descriptions. “It has all my formats, templates, and terminology,” he says. “We have an HR, data and prompt experts, and we design something that generates very good job postings. Now nobody needs to know how to do the amazing prompting that generates job descriptions.”

The tool is built on top of Microsoft Azure, but the company also built it for Google Cloud Platform and AWS. “We have to serve our clients, and they exist on every cloud,” Says Greenstein. Similarly, it’s optimized to use different models on the back end, because that’s how clients want it. “We have every model working,” he adds. “Llama 2, Falcon — we have everything.”

The market is changing quickly, of course, and Greenstein suggests enterprises adopt a “no regrets” policy to their AI deployments.

“There’s a lot people can do,” he says, “like building up their data that’s independent of models, and building up the governance.” Then, when the market changes, and a new model comes out, the data and governance structure will still be relevant.

The fine tuning

Management consulting company AArete took open source model GPT 2 and fine tuned it on its own data. “It was lightweight,” says Priya Iragavarapu, the company’s VP of digital technology services. “We wanted an open source one to be able to take it and post it internally in our environment.”

If AArete used a hosted model and connected to it via API, trust issues come up. “We’re concerned where the data from the prompting might end up,” she says. “We don’t want to take those risks.”

When choosing an open source model, she looks at how many times it was previously downloaded, its community support, and its hardware requirements.

“The foundational model should also have some task relevancy,” she says. “There are some models for specific tasks. For example, I recently looked at a Hugging Face model that parses content from PDFs into a structured format.”

Many companies in the financial world and in the health care industry are fine-tuning LLMs based on their own additional data sets.

“The basic LLMs are trained on the whole internet,” she says. With fine tuning, a company can create a model specifically targeted at their business use case.

A common way of doing this is by creating a list of questions and answers and fine tuning a model on those. In fact, OpenAI began allowing fine tuning of its GPT 3.5 model in August, using a Q&A approach, and unrolled a suite of new fine tuning, customization, and RAG options for GPT 4 at its November DevDay.

This is particularly useful for customer service and help desk applications, where a company might already have a data bank of FAQs.

Also in the Dell survey, 21% of companies prefer to retrain existing models, using their own data in their own environment.

“The most popular option seems to be Llama 2,” says Andy Thurai, VP and principal analyst at Constellation Research Inc. Llama 2 comes in three different sizes, and is free for companies with fewer than 700 million monthly users. Companies can fine-tune it on their own data sets and have a new, custom model fairly quickly, he says. In fact, the Hugging Face LLM leaderboard is currently dominated by different fine-tunings and customizations of Llama 2. Before Llama 2, Falcon was the most popular open source LLM, he adds. “It’s an arms race right now.” Fine tuning can create a model that’s more accurate for specific business use cases, he says. “If you’re using a generalized Llama model, the accuracy can be low.”

And there are some advantages to fine-tuning over RAG embedding. With embedding, a company has to do a vector database search for every query. “And you’ve got the implementation of the database,” Thurai says. “That’s not going to be easy, either.”

There are no context window limits on fine tuning, either. With embedding, there’s only so much information that can be added to a prompt. If a company does fine tune, they wouldn’t do it often, just when a significantly improved version of the base AI model is released.

Finally, if a company has a quickly-changing data set, fine tuning can be used in combination with embedding. “You can fine tune it first, then do RAG for the incremental updates,” he says.

Rowan Curran, analyst at Forrester Research, expects to see a lot of fine-tuned, domain-specific models arising over the next year or so, and companies can also distil models to make them more efficient at particular tasks. But only a small minority of companies — 10% or less — will do this, he says.

Software companies building applications such as SaaS apps, might use fine tuning, says PricewaterhouseCoopers’ Greenstein. “If you have a highly repeatable pattern, fine tuning can drive down your costs,” he says, but for enterprise deployments, RAG is more efficient in 90 to 95% of cases.

“We’re actually looking into fine-tuning models for specific verticals,” adds Sebastien Paquet, VP of ML at Coveo, a Canadian enterprise search and recommendations company. “We have some specialized verticals with specialized vocabulary, like the medical vertical. Enterprises selling truck parts have their own way of how the parts are named.”

For now, however, the company is using OpenAI’s GPT 3.5 and GPT 4 running on a private Azure cloud, with the LLM API calls isolated so Coveo can switch to different models if needed. It also uses some open source LLMs from Hugging Face for specific use cases.

Build an LLM from scratch

Few companies are going to build their own LLM from scratch. After all, they are, by definition, quite large. OpenAI’s GPT 3 has 175 billion parameters and was trained on a data set of 45 terabytes and cost $4.6 million to train. And according to OpenAI CEO Sam Altman, GPT 4 cost over $100 million.

That size is what gives LLMs their magic and ability to process human language, with a certain degree of common sense, as well as the ability to follow instructions.

“You can’t just train it on your own data,” says Carm Taglienti, distinguished engineer at Insight. “There’s value that comes from training on tens of millions of parameters.”

Today, nearly all LLMs come from the big hyperscalers or AI-focused startups like OpenAI and Anthropic.

Even companies with extensive experience building their own models are staying away from creating their own LLMs.

Salesloft, for example, has been building their own AI and machine learning models for years, including gen AI models using earlier technologies, but is hesitant about building a brand-new, cutting edge foundation model from scratch.

“It’s a massive computational step that, at least at this stage, I don’t see us embarking on,” says Fields.