Telemetry: Ensuring Code That Works
When people discuss Craftsmanship in the software development community, they often talk about code techniques like Clean Code, Clean Architecture, SOLID, and YAGNI, or they talk about Clean Architecture more generally, with its onion layers or hexagonal architecture, etc.
Such discussions are valuable and can reveal much about a development team’s maturity. However, by focusing solely on these aspects, we tend to overlook an equally important factor: how the code we write affects the quality attributes of our systems (see our previous post on How to Choose the Best Software Architecture with Architectural Drivers?).
Telemetry is quickly becoming an indispensable tool within this framework, complementing craftsmanship and bridging the gap between code quality and system quality attributes. Telemetry lets developers track, analyze, and improve code performance and reliability by giving them valuable real-time information about system behavior.
In this post, we’ll look at how integrating telemetry into the development process helps us stick to the principles of craftsmanship, ensuring clean and well-organized code, system efficiency, and the required quality.
Telemetry is the process of collecting, measuring, and analyzing data about a software system’s performance, use, and behavior in real time or retrospectively.
Whether your system is a monolith or in a distributed architecture, you need telemetry data to identify problems, improve functionality, and make decisions based on objective data.
Telemetry’s Four-Step Process for Success
Telemetry has four main components:
- Code instrumentation: Instrumentation is the process of adding data collection points, such as performance counters, event trackers, and logs, to the software source code.
- Data collection: Once the code is instrumented, the data it produces is collected and sent to a central system where it is stored and analyzed. Data collection can be done in real time or at regular intervals depending on the application’s needs and performance constraints.
- Data aggregation and analysis: The collected data is then aggregated and analyzed using methods such as statistical analysis, time series analysis, and machine learning to identify trends, patterns, and potential problems.
- Visualization of results: The data analysis results are presented in graphs, dashboards, and reports so that they are easy to interpret and understand. This enables teams and stakeholders to make decisions based on accurate information.
Telemetry Data Collection Types
Telemetry typically collects three categories of data: traces, metrics, and logs.
Traces show transactions or execution flows across system components. This helps you understand how the components interact and identify bottlenecks or performance issues. Here are some examples of traces you might want to collect:
- API request tracing: When a client sends a request to an application programming interface (API), a trace can be created to follow the request as it moves through the various services and components involved. This includes service latency, any errors encountered, and overall response times.
- Database transaction tracing: Traces can be used to track requests and transactions performed in a database. This helps identify slow requests, locking problems, and other database performance bottleneck issues.
- Tracing operations between microservices: Traces can be used in a microservices architecture to track interactions between the various microservices that make up an application. This can identify service-to-service communication problems, service failures, and overall performance issues.
- Function call tracing: Function calls within an application can also be tracked with traces. This can be useful for identifying problematic sections of code that take too long to run or cause errors.
Metrics are quantitative measurements that provide insights into specific aspects of software performance, resource use, or function. Examples include request response time, error rate, resource consumption rate, and latency. For example:
- Requests per second (RPS) rate: This metric measures the number of requests processed per second by an application or service. A high RPS could indicate the system is experiencing a heavy load and may need optimizing or scaling.
- Error rate: The error rate measures how often errors, such as connection or data processing errors, occur in the system. A high error rate could indicate stability or performance issues.
- Memory (RAM) usage: This metric shows how much memory an application or service is using. High memory use can degrade system performance or even cause a crash.
- Processor (CPU) usage: This metric measures the percentage of processor use by an application or service. High CPU usage could indicate the system is experiencing a heavy load and may need optimizing or scaling.
- Response time: Response time is the time from sending a request to a system to receiving a response. A high response time could indicate performance issues or bottlenecks.
- Availability rate: The availability rate measures the amount of time a service or application is operational and accessible for. You need a high availability rate for the best user experience.
Logs are chronological records of events or actions that happen in software. They are usually used for debugging, problem analysis, and preventive system activity monitoring.
- Error logs: Error logs record errors encountered by an application or service, such as unhandled exceptions, connection errors, or problems accessing resources. These logs can help identify and fix problems.
[2023-04-20 14:35:12] ERROR: Failed to connect to database - ConnectionTimeoutException
- Information logs: Information logs provide general information about the normal operation of the application or service. They are useful for tracking the system status and user actions.
[2023-04-20 14:35:30] INFO: User 'johndoe' successfully logged in
- Debug logs: Debug logs contain detailed information about an application or service’s internal functioning. They are useful for development and debugging complex issues.
[2023-04-20 14:35:45] DEBUG: Executing SQL query: SELECT * FROM users WHERE id = 42
- Alert logs: Alert logs record potentially problematic situations or unusual events that may need attention but do not always result in errors.
[2023-04-20 14:36:10] WARNING: Disk usage exceeded 90%, consider cleaning up or expanding storage
- Access logs: Access logs record the requests and responses that a service or application receives. This includes information about the clients, the requested URLs, HTTP status codes, and response times.
[2023-04-20 14:36:25] ACCESS: 192.168.1.2 – – [20/Apr/2023:14:36:25 +0000] “GET /api/v1/users/42 HTTP/1.1” 200 356
Telemetry for Code Quality
In this section, we’ll look at how telemetry can improve code quality and ensure applications and services are working as well as possible. We will also explore how telemetry can help find and fix problems before they affect end users.
Telemetry for Monitoring Quality Attributes
Telemetry can continuously monitor software quality attributes, such as performance, reliability, security, and scalability. By incorporating telemetry into the development process, developers can identify potential problems quickly and fix them before they become major issues.
For example, monitoring performance metrics like response time and latency can help developers identify bottlenecks and improve application speed and efficiency by optimizing their code.
Telemetry for Debugging and Problem Analysis
Telemetry can also help with debugging and problem analysis by providing detailed information about system behavior and the errors encountered. The traces, metrics, and logs collected by telemetry help developers understand the underlying causes of problems and swiftly fix errors and performance and security issues.
Telemetry for Data-Driven Decision Making
By incorporating telemetry into the development process, development teams can make decisions based on objective and measurable data instead of hunches or guesses. Telemetry data can be used to assess how effective code changes are, to determine development priorities, and to aid architecture or design decision-making.
Telemetry for Proactive Monitoring and Preventive Maintenance
Telemetry enables the proactive monitoring of applications and services so development and operations teams can find and fix problems before they affect end users. By monitoring trends and patterns in the telemetry data, teams can also plan and perform preventative maintenance to avoid problems in the future and ensure maximum system availability.
Developing a Reliable and Effective Telemetry Strategy
Most developers can implement telemetry, but having a solid, well-thought-out approach can help make sure that telemetry is used in the best way to improve code quality and system performance.
Here are the key elements and steps needed to design and implement an effective, reliable telemetry strategy:
Step 1: Define Your Telemetry Objectives
The first step in devising a telemetry strategy is defining your goals. For example:
- Improve system performance and reliability
- Find and fix problems quickly
- Identify and improve bottlenecks
- Make informed architecture and design decisions
Step 2: Identify The Relevant Metrics and Data
As we saw in the section above, once you have set your code quality goals, it is important to identify the metrics and data that will help you achieve them. Consider the following information types:
- Traces to monitor transactions and interactions between components. This helps improve code quality.
- Metrics to measure specific performance aspects, resource usage, and how the software is functioning. This allows you to assess the impact of the code quality on the system as a whole.
- Logs to record events and actions that occur within the software. This provides insights into how the system behaves in terms of code quality, making it easier to find and fix any associated problems.
Step 3: Instrument the Code
The next step is to instrument the code to collect the required data. This is done by embedding data collection points in the source code, such as performance counters, event trackers, and logs.
Step 4: Collect, Aggregate, and Analyze the Data
Once the code has been instrumented, set up systems to collect, aggregate and analyze the data. This can include using telemetry services and tools to collect and store the data and using analysis methods to identify trends, patterns, and potential problems. For example:
- Prometheus: Prometheus is an open-source monitoring and alert system that collects and stores metrics from your services and applications. It is designed to handle time series data and is often used with Grafana for data visualization.
- Elasticsearch, Logstash, Kibana (ELK Stack): ELK Stack combines three open-source tools (Elasticsearch, Logstash, and Kibana). These tools enable the centralized collection, storage, indexing, and visualization of log data. Elasticsearch is a distributed search engine, Logstash is a data processing pipeline, and Kibana is a data visualization tool.
- Datadog: Datadog is a cloud application performance analysis and monitoring service. Datadog collects and aggregates data from metrics, traces, and logs to help you monitor, troubleshoot and optimize your apps.
- Azure Monitor: Azure Monitor is a monitoring and diagnostic service built into the Azure platform. It allows you to collect, analyze and act upon telemetry data from your Azure resources and applications. Azure Monitor supports the collection of metrics, logs, and traces and provides advanced features, such as alerting, data analysis, and integration with other Azure services for standardized monitoring.
Step 5: Visualize and Act on the Results
Lastly, present the data analysis results in graphs, dashboards, and reports so they are easy to understand and interpret. Then, you can use this information to decide how to improve your code and system performance and to find and fix problems.
Telemetry Tips and Best Practices
Telemetry is a valuable tool for software developers because it gives them real-time information about software usage. The collected data can be used to identify problems, improve performance, and improve the user experience. However, users may worry about privacy and security when data is collected.
Here are some best practices for using telemetry responsibly and effectively in software development:
- Clearly define which data will be collected: Before you start collecting data, you should clearly define the data to be collected and ensure that it is relevant to the software development. Determining how the data will be used, stored, and protected is also essential.
- Obtain the user’s consent: Users must expressly consent before data is collected. It is also important to allow users to opt out of data collection.
- Ensure data security: The collected data must be stored and protected securely to prevent unauthorized access. Therefore, it is essential to follow best data security practices, such as encrypting data and limiting access to it.
- Use the data responsibly: Collected data must be used responsibly and transparently. Developers must ensure that data is not used for malicious purposes and is not shared with third parties without the user’s consent.
- Data anonymization: To protect users’ privacy, ensure the data you collect does not include personally identifiable information (PII) or anonymize it before storing and analyzing it.
- Limit data collection: Only collect the data you need to achieve your telemetry objectives. Do not collect sensitive or irrelevant information.
- Secure the data: Protect the data you collect by encrypting and storing it securely. Control who can access the data by defining appropriate permissions and roles.
- Documentation: Document the telemetry processes, metrics, and the tools used to help with understanding and collaboration across the team.
- Don’t Repeat Yourself (DRY): You can start with a proven method, such as Utilization, Saturation, and Errors (USE), Rate, Errors, Duration (RED), and the Four Golden Signals. These methods provide the key metrics you need to monitor the health and performance of a system. You can use one or more of these methods or combine them to meet your specific needs.
|USE||Identify system resource performance problems||· Utilization: the percentage of time the resource was busy serving requests.
· Saturation: the amount of work a resource has to do, often in a queue.
· Errors: the number of errors generated by the resource.
|RED||Identify service performance problems||· Rate: the number of requests per second handled by the service.
· Errors: the number of errors per second generated by the service.
· Duration: the duration of the requests handled by the service.
|Four Golden Signals||Monitor the health and performance of a system||· Latency: the time required to serve a request.
· Traffic: the number of requests received by the system.
· Errors: the rate of errors generated by the system.
· Saturation: how system resources are being used and the remaining capacity.
The Future of Telemetry with OpenTelemetry
OpenTelemetry is an open-source project that aims to make it easier to integrate telemetry into applications. It provides a complete solution for collecting telemetry data from various applications, services, and platforms. It does this by using a standard API and software development kits (SDKs) for different programming languages and a plugin system that make connecting to various telemetry services easy. This standardized, open approach to telemetry should help developers work more efficiently and make the most of system monitoring and optimization.
If you want to learn more, read our post on OpenTelemetry: .NET Instrumentation in the Future.
Telemetry: Key Takeaways
Telemetry is an essential tool for any modern developer. By monitoring system performance, spotting problems before they happen, and quickly fixing them, developers can give users the best experience possible and make better software. This is what the Craftsmanship philosophy is all about.
Want to learn more about Craftsmanship? See all Cellenza’s Craft Month posts here:
- Is the Craft Still Relevant?
- How to Choose the Best Software Architecture with Architectural Drivers?
- How to Build an Infrastructure with Terraform?
- Craft and PowerShell: Why Software Engineering Practices Need to Be Applied to Infrastructure
- PySpark Unit Test Best Practice
- How to Boost Your Apps’ Performance with Asyncio: A Practical Guide for Python Developers