How DevOps Teams Can Manage Telemetry Data Complexity

Processing telemetry data at scale and near real-time presents many challenges for DevOps teams.

July 12, 2023

How DevOps Teams Can Manage Telemetry Data Complexity

In today’s software-driven world, the amount of data generated is growing exponentially and will only continue to increase. Tucker Callaway, CEO of Mezmo, discusses the challenges of processing telemetry data and how DevOps teams can overcome them effectively. 

With hundreds of millions of new applications set to be deployed in the coming years, the compound annual growth rate for data (CAGR) is expectedOpens a new window to be 22%.

This data explosion presents opportunities for organizations to gain valuable insights and maintain a competitive edge. However, managing and securing this data creates challenges. Compliance and security costs are rising while IT budgets are decreasing. Additionally, a shortage of skilled personnel makes it challenging to hire DevOps employees.

These obstacles pose a problem for already overworked DevOps teams, which must manage and analyze telemetry data (traces, metrics, and logs) to ensure applications run smoothly. Traditionally, organizations have attempted to manage telemetry data in a data lake and tried to make sense of it later. But this method is becoming less effective as data volume and complexity surge, leading to data hoarding in which organizations accumulate data with no clear strategy for using it. 

To solve this problem, DevOps teams need a more effective way to process telemetry data, specifically in a stream, where the data is acted on while it is in transit before it is stored. This change in approach required a multi-pronged effort that involves normalizing, filtering, and reducing telemetry data before it reaches its destination, which is essential in observability, security, compliance, and analytics use cases. However, processing telemetry data at scale and near real-time presents several obstacles that DevOps teams must overcome to achieve efficiency and effectiveness.

See More: Information, Data Management, and the Librarianship Revival

Too Much Telemetry Data With Little Value

Reducing the total amount of logs without sacrificing visibility requires a combination of strategies tailored to an organization’s unique needs. One approach is routing logs based on specific conditions. By grouping similar apps, organizations can reduce the number of logs to be analyzed and store the low-value data in an S3 bucket. Organizations can keep high-value logs for further analysis using if-then-else conditions.

Another effective approach is sampling and aggregation. Sampling involves reducing the logs by taking a representative fraction that can provide visibility without overwhelming the system. Aggregation involves combining metrics to eliminate redundancy and provide a more accurate data representation.

Deduplication and reduction are more complex strategies that require identifying specific conditions before logs can be filtered out. Automatic solutions claiming to deduplicate logs are often ineffective because each company has unique data requiring a tailored approach. Instead, identifying patterns or keywords within logs can help organizations match and filter logs that meet specific criteria.

Telemetry Data Not Standardized

Once DevOps teams have completed the bulk removal of unnecessary data, the next step is to transform the data into a more manageable format. To ensure a successful transformation, establishing standards, such as JSON logging or OpenTelemetry, is important. While not all applications may be a fit for these standards immediately, promoting them can lead to eventual adoption. Standards can also be used as a guide for dev teams, even if they are not fully enforceable. This helps to ensure that data is organized across teams.

Once standards are established, custom fields can be removed and placed in a separate section of the logs, such as a metaobject, to make it easier to reference and manage. Additionally, setting limits for the cardinality of tags or labels on metrics can help ensure that the metrics meet specific needs when they reach their downstream destination. For example, removing extra host names or unnecessary IDs can improve the efficiency of logs.

High Automation Overhead

Although automation can bring many benefits, it can also introduce unnecessary complexity and overhead. To optimize automation and ensure its value, it is crucial to evaluate where to use code, how to manage configuration, and how to maintain agents.

In terms of using code, it should be reserved for business logic. The real value of automation comes from capturing an organization’s unique business processes and codifying them in a way that enables repeatable and scalable automation.

One common mistake DevOps teams make is relying on cron jobs to automate business processes. While cron jobs can be helpful in some cases, they tend to be difficult to troubleshoot when things go wrong. By minimizing cron jobs, DevOps teams can reduce the risk of automation failures and simplify troubleshooting when problems occur.

Another key consideration is how to manage configuration. Although all configurations can be coded, starting with code may not always be the best approach. Depending on the required flexibility, alternative approaches may be necessary before transitioning to code or vice versa. It is recommended to use tools that offer both options and assess the needed components, such as filtering patterns and logic, to identify the most suitable solution.

Maintaining agents is another consideration. Many organizations struggle to update agents with the latest patches and features, resulting in lost telemetry data. Establishing standardized processes for updating agents as part of the data collection process is essential to address this issue.

See More: Why Automation Is Doomed Without Process Orchestration

A Growing Tool Stack

The final challenge to address is tool sprawl. Many DevOps teams use more than 10-15 tools daily, and the number of tools is increasing rapidly, with some teams using 20-30 tools. This proliferation of tools makes it difficult to manage them all, and some tools may be neglected. The massive amounts of data compound the problem these tools must process.

One way to reduce tool sprawl is to choose the “first mile” carefully. This means selecting the best tools to capture and process data at the source and testing them before using them. Switching tools downstream is much easier once the data flows through a data pipeline.

However, changing agents responsible for collecting and sending data to the pipeline can be challenging. To mitigate this, DevOps teams can run the agents in parallel with an open-source agent or alternative tool to provide flexibility. Testing specific agents before committing to them can also save time in the long run.

As more organizations adopt digital technologies, the challenge of processing vast amounts of telemetry data continues to grow. While challenges will inevitably arise, transitioning to processing the data in a stream and away from direct storage is a step in the right direction. This approach can help organizations stay ahead of the competition by unlocking insights and opportunities that might have been missed.

How are you managing the challenges of processing telemetry data? Share with us on FacebookOpens a new window , TwitterOpens a new window , and LinkedInOpens a new window . We’d love to hear from you!

MORE ON TELEMETRY DATA

Tucker Callaway
Tucker Callaway is the CEO of Mezmo. He has more than 20 years of experience in enterprise software, with an emphasis on developer and DevOps tools. He is responsible for driving Mezmo’s growth across all revenue streams and creating the foundation for future revenue streams and go-to-market strategies. He joined Mezmo in January 2020 as president and CRO and took the torch as CEO six months later. Prior to Mezmo, he served as CRO of Sauce Labs and vice president of worldwide sales at Chef.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.