Transforming DevOps With AI: Practical Strategies To Supercharge Your Workflows
As someone interested in DevOps, I wondered how all the AI advances could benefit my field. OpenAI used deep learning to release groundbreaking products like ChatGPT and Sora. Microsoft used similar technologies to revamp its products, notably enhancing GitHub with Copilot. Many startups have sprung up, and large tech companies have poured billions into AI research.
Ultimately, how can engineers use this technology to reduce toil, add value to the software development lifecycle (SDLC), and increase development velocity? There are a few interesting options.
Say you have a distributed systems environment with lots of pods serving different services. You also have some observability tooling, like Prometheus, to provide a stream of system metrics, such as CPU usage, memory usage, disk I/O, network statistics and even container logs.
Random Forests can help you build a self-healing system. With some creative feature engineering to get rolling averages, rates of change, and error log counts, the Random Forest Classifier can generate corrective actions. For example, it can roll back services if pods fail to start after a deployment or increase pods on a service if CPU and network queries per second (QPS) exceed a threshold.
Since a Random Forest Classifier is an ensemble of many decision trees, each trained on a subset of the features, it can effectively model non-linear data while minimizing overfitting by averaging multiple decision trees and providing multi-class outputs. That’s why I would prefer them over support vector machines (SVMs) and decision trees which don’t provide all those advantages.
Metrics such as CPU usage and network stats like QPS are continuously emitted by a service. You can naturally interpret them as time series and use a long short-term memory (LSTM) neural network to find anomalies. LSTMs are a type of recurrent neural network (RNN) that are good at “remembering” long-term events.
An LSTM trained on, say, CPU utilization data can ignore seasonal events such as nightly backups and unimportant events like minor rises during maintenance events such as host-to-host migrations. It also can alert on things like an unexplained rise during off-peak hours caused by a security breach.
For example, you could train the LSTM on historical data of overall traffic QPS. Once deployed, it references the current QPS and predicts future values that you can later compare to the real future values. If there’s too large of a difference, it sets off an alert. The detected anomalies can also be cataloged and used to fine-tune the model to increase performance over time.
Getting a bit more advanced, Q-Learning is a type of reinforcement learning algorithm that can look at performance metrics, logs, and events such as restarts and deployments, and then update the system configuration. For example, if the metrics point to strained CPUs, the algorithm can decide whether to scale up the pods by adding additional CPUs or spin up additional pods. A reinforcement learning agent can learn from the action’s outcome and adjust its policy without human intervention.
To build this, you would have to define the states (various performance metrics), actions (scale up pods, etc.), and rewards (system performance) following an action. Then the model can be trained on historical data without much manual labeling, and it could be back-tested to ensure good performance.
I bet you know by now what a large language model (LLM) is and how powerful LLMs can be. While LLM-based models can only be trained by well-resourced companies like OpenAI, the APIs for these models are available for public use, albeit at a steep price. But if you do have access to an LLM, you can build a bot that accepts natural language instructions, like “Set up a new service using X image in the development environment and write a deployment report,” after which a well-integrated LLM can carry out the actions.
If an LLM is not at hand, Bidirectional Encoder Representations from Transformer (BERT) is another model used for Sequence-to-Sequence (seq2seq) use cases. It can accomplish similar things, in a more limited but more controlled way. For example, “Restart the web server on node 5” would translate to {“action”: “restart”, “target”: “web server”, “node”: 5} which can be interpreted by a traditional program.
AI is changing the face of product development and has become a common fixture of software and technology in general. While it’s exciting to use the cutting edge of AI, it’s also enlightening to learn how it can be used directly to reduce toil and improve business outcomes. Simple models can achieve this end by offering proven results for smaller investments, while larger models could revolutionize the SDLC at your company. The only wrong choice is not thinking about it at all and continuing with business as usual.
Filed Under: AI, Blogs, Business of DevOps, DevOps Practice, DevOps Toolbox, Doin’ DevOps, Social – Facebook, Social – X Tagged With: ai, ai algorithms, devops tools, lstm, machine learning, Productivity, workflow
Secure Coding Practices
Step 1 of 7
14%
Does your organization currently implement secure guardrails in the software development process?(Required)
Yes, extensively across all projects
Yes, but only in specific projects or teams
In the process of implementation
No, but planning to in the near future
No, and no plans to implement
What are the biggest challenges you face in implementing secure guardrails within your development processes? (Select all that apply)(Required)
Lack of awareness or understanding
Technical difficulties in integration
Resistance from development teams
Lack of suitable tools
Cost constraints
Other
Other, tell us more:
How effective do you find secure guardrails in preventing security vulnerabilities in your projects? Rate on a scale from 1 (Not effective) to 5 (Highly effective)(Required)
1
2
3
4
5
To what extent are your secure guardrails automated?(Required)
Fully automated
Mostly automated with some manual processes
Equally automated and manual
Mostly manual with some automation
Entirely manual
What features do you prioritize in a secure guardrail solution? (Rank in order of importance)Ease of integration into existing workflowsComprehensive coverage of security vulnerabilitiesCustomizability for specific project needsMinimal impact on development speedActionable insights and recommendationsSupport for a wide range of programming languages and frameworks
What are your organization’s plans regarding the adoption or enhancement of secure guardrails within the next 12 months?(Required)
Expand the use of secure guardrails to more projects
Enhance the capabilities of existing secure guardrails
Maintain current level of secure guardrail use without changes
Reduce reliance on secure guardrails
No plans related to secure guardrails
What best describes your primary role?(Required)
Security Engineer
DevOps Engineer
Platform Engineer
Security champion on the development team
Software Developer
CISO (or equivalent)
Sr. Management (CEO, CTO, CIO, CPO, VP)
Manager, Director
Other
Δ