Over the next few years, we’ll see a seismic shift in how devops organizations, agile development teams, site reliability engineers, and IT Ops will achieve an increasingly complex mission.
How can IT improve reliability, performance, and security while deploying
more innovations at increasing release frequency and with fewer incidents
and defects?
Over the last ten years, solutions have included migrating to the cloud, centralizing observability data, automating operations, leveraging machine learning, and deploying other AIOps capabilities.
And for the next several years (One? Three? Five? What’s your estimate?),
we’ll see new generative AI platforms emerge, and existing platforms add LLM
capabilities that will transform how IT teams operate.
“In platforms targeting DevOps, IT Ops, and ITSM, the remarkable
capabilities of GPT and LLM are transforming operations,” says Vijay Iyer,
president of Americas at Mastek. “With
advanced problem-solving abilities, GPT and LLM platforms empower
organizations to efficiently address complex issues, optimize efficiency,
and drive innovation in the IT landscape.”
What can IT, DevOps, SREs, and developers do today with gen AI and LLM
capabilities to improve IT operations? Here’s a list:
1. Generate service level objectives
Kit Merker, chief growth officer of
Nobl9, has an optimistic viewpoint on
generative AI’s impact on DevOps, SRE, and IT Ops. “I don’t believe that GPT
technologies will put developers or DevOps folks out of a job soon — to the
contrary, it will create more jobs! — a lot of mundane and repetitive
code-adjacent tasks can be further automated using specialty LLMs,” he says.
Merker shares a great example of how generative AIs can capture reliability
data, and help site reliability engineers create service-level objectives.
“SLOgpt.ai is an example of this, which uses Google Vertex AI and PaLM2, and
is trained to understand reliability engineering concepts and can even
answer questions about a Service Level Objective (SLO) generated from a
user-uploaded screenshot of an observability metric,” he says. “You can ask
SLOgpt.ai to create an OpenSLO yaml or to write a song about your SLO; the
choice is yours.”
2. Propose incident root cause
Marko Anastasov, co-founder of
Semaphore CI/CD, says that instead of
gathering in war rooms and organizing bridge calls to review mounds of
operational data, IT Ops can use LLMs to identify the root cause of
incidents. “In this field, GPT and LLM can be used to automate incident
response by providing real-time insights into the root cause of an incident
and suggesting remediation steps,” he says. “This reduces the time to solve
incidents, improves customer satisfaction, and makes the lives of support
staff much easier.”
3. Grind out troubleshooting, creating documentation, and managing policies
Working in IT has many bright spots to showcase innovations, automate
processes, and improve system reliability, but some responsibilities are
time-consuming drudgeries. Tony Johnson, CI/CD Engineer III at
Rise8, says Gen AI can be a powerful
assistant. “With the evolution of GPT and LLMs, DevOps, IT Ops, and ITSM
platforms now house predictive troubleshooting, automated documentation, and
real-time policy enforcement capabilities, unleashing new heights in
operational efficiency and resilience,” he says.
Read more from Rise8 on
achieving impactful software and user joy, spearheading digital transformation with
action and ambition, and
shipping with continuous delivery.
4. Query log files to find anomalies
One user generates expensive queries undermining performance for all active
users – how do you find the needle in the haystack? Emily Arnott, content
marketing manager at Blameless,
suggests using an LLM to query log files to find the answers. “A capability
on the close horizon for LLMs is parsing huge log files that typical regex
searching can’t make sense of,” she says. “Operations people often end up
with a huge surplus of data and want to find any patterns or anomalies that
can be detected in them. LLMs make this easy: even if you don’t know exactly
what you’re looking for, they’re sophisticated enough to highlight things
worth seeing.”
5. Migrate scripts and automations across platforms
When you need to change platforms, do you have to rewrite all the scripts
and automations or hire someone to do all the work to port code across
platforms? Not so, says Andrew Amann, CEO of
NineTwoThree Studio. “We’ve
recently leveraged ChatGPT’s innate ability to translate from one language
to another to convert Terraform scripts to CloudFormation,” he says.
“ChatGPT reduced 90% of the effort, requiring minimal edits and freeing time
to test ported scripts thoroughly. We also did the opposite (CloudFormation
to Terraform) for another client to become cloud agnostic.”
Digital Trailblazers! Join us Fridays at 11am ET for a live audio discussion on digital transformation topics: innovation, product management, agile, DevOps, data governance, and more!
6. Reduce time to resolve incidents
When there’s a major incident causing an outage or slow performance, IT Ops
feels their customers and employers breadth on their necks to resolve
issues and restore systems.
“We’ve recently added AI-driven insights to our open-source-powered platform
via a large language model (LLM) to reduce the critical mean time to
recovery (MTTR) metric while extracting more value from less telemetry data
and lower storage costs,” says Asaf Yigal, co-founder and VP of product at
Logz.io. “Generative AI is proving to be
key at helping engineering, DevOps and ITOps teams optimize cloud
applications and infrastructure to handle new and emerging availability,
performance, resilience and security issues.”
Logz.io recently announced the integration of generative AI into its Open 360 Platform
and
AI alert recommendations to reduce MTTR.
7. Search to find new technologies and platforms
If you’re researching new technologies, platforms, and services, Amann
suggests trying an LLM search like Bing to find answers quickly. He asks,
“Do you find yourself looking for a step-by-step guide, not knowing the
exact name of the technology?” New Bing Search works when you ask questions
like: ‘Find me a recent guide to set up access from GCP to an on-prem SQL
database, similar to what is called site-to-site VPN on AWS.’”
8. Remediate issues with an LLM’s recommendations
“Houston, we have a problem,” and now comes the tough task of recommending
the appropriate fixes that are easy to execute and are low risk of breaking
other services. Matt Riley, GM of enterprise search at
Elastic, suggests, “With generative AI
technology taking the world by storm, we’ve already seen several leading
DevOps solutions add copilots to their toolkits that enable teams to move
from just observing and monitoring their data to also receiving effective
remediation steps immediately when they need to resolve an issue.”
Developing automations is a key step for creating recipes to remediate
issues. Riley adds, “In ITOps, large language models like GPT—especially
when augmented by enhanced search capabilities—are helping teams quickly
find the information they need and automate previously manual
processes.”
9. Add an AI observability assistant to your NOC
Every network operations center (NOC) is under pressure to support more
mission-critical applications, increase uptime, and resolve issues faster.
Camden Swita, senior product manager at
New Relic, says
New Relic Grok is
the world’s first generative AI observability assistant and “making
observability practices ubiquitous by removing barriers to adoption, like
learning bespoke query languages or navigating the massive amount of
telemetry most engineers confront every day.”
With every new app, database, and service comes up more observability data,
and “all that data may overwhelm a human,” says Swita. “Pairing New Relic’s
unified telemetry data platform with OpenAI’s LLMs works -we take the
reasoning power of the LLM, give it tools to translate plain language into
queries, look for deviations, and more. Now, engineers don’t need to slog
through data manually—they can just ask, ‘What’s on fire?’ and cut right to
the chase.”
10. Increase dev velocity with code examples
“If you’re a software developer or a devops engineer, you might experiment
with generative AI tools and wonder what it will mean for your profession
and how it will change your work,” I wrote in a recent article on
ChatGPT and software development.
More code = more code to test, so look out for my upcoming article on how
LLMs will impact continuous testing.
11. Innovate and deliver new natural language query capabilities
The most exciting area for IT is identifying generative AI and LLM innovation areas. Here are some of my suggestions from recent articles:
- What can ChatGPT and LLMs really do for your business? – I share several platforms adding LLM capabilities, including Adobe, Atlassian, Coveo, Crowdstrike, GitHub, Google, Microsoft, Salesforce, and ServiceNow.
- Business drivers to deliver significant ROI from AI search include improving customer satisfaction and accelerating employee onboarding.
- How generative AI impacts your digital transformation priorities includes recommendations for Digital Trailblazers to prep data for private LLMs and communicate an LLM governance model.
AI is evolving quickly, and I plan to add to this article as platforms
release new generative AI and LLM capabilities targeting DevOps and IT Ops.
Please sign up for the
Driving Digital Newsletter
to access all of my thought leadership, including updated versions of this
post. Also, please consider joining us for a future session of
Coffee with Digital Trailblazers, where we discuss topics for aspiring transformation leaders, including
AI, DevOps, and digital transformation.
No comments:
Post a Comment
Comments on this blog are moderated and we do not accept comments that have links to other websites.