The Cognitive Edge: How Hibob is Revolutionizing DevOps with AI Agents

Guy Menahem
Aug 15
5 min read

Based on the video with Yishai Halpert, Director of DevOps at HiBob

In today's fast-paced tech landscape, the rise of AI in development is a double-edged sword. While developers are shipping code faster than ever, DevOps and Site Reliability Engineering (SRE) teams are feeling the pressure. Development cycles are shorter, expectations are higher, and the infrastructure must now be supported at twice the speed, often with the same resources. According to Gartner, over 70% of developers are already using AI tools, a reality that puts DevOps teams under immense strain.

At Hibob, a fast-growing HR tech company, this challenge was met with a groundbreaking solution: an in-house, AI-powered platform to create a "cognitive edge". In a recent presentation, Ishai Halpert, Director of DevOps at Hibob, walked through how they integrated agentic AI into their core workflows, transforming their approach to developer experience, reliability, and incident management.

This is the story of how they built it.

The Challenge: When AI Accelerates Everything

The explosion of AI-driven development created three core challenges for Hibob’s platform teams:

Accelerated Delivery Cycles: With developers pushing changes faster, the demand for quicker pipelines and deployments skyrocketed.
Increased Toil and Stress: The constant pressure to keep up led to more repetitive tasks and firefighting, which took a toll on both system reliability and team morale.
Shifting Skill Sets: The role of a DevOps engineer began to change fundamentally. It was no longer just about managing infrastructure and pipelines; it was about designing intelligent automations and understanding how to integrate with a new generation of AI-driven tools.

After establishing a solid foundation with Terraform, GitOps, and ephemeral environments, the team asked, "What's next?". After exploring open-source and off-the-shelf products, they found each came with its own trade-offs. The solution? They decided to build it themselves. The journey began at a company hackathon, and their first AI agent was born.

AiBops: Your Conversational DevOps Assistant in Slack

The first agent, AiBops, was created with a clear mission: automate every possible request from developers and make DevOps tasks a seamless, conversational experience. Built as an AI-powered Slack interface, AiBops now handles over 70 different automated flows.

Request: A developer sends a natural language request to AiBops in Slack, like "create a new SQS queue" or "add an environment variable to the docs service".
Authorization & Intent: The request hits a Python backend where the user is first authorized via Okta. Then, the OpenAI function-calling API interprets the user's intent and matches it to an internal function.
Integration & Execution: AiBops is integrated with a vast array of tools, including AWS, GitHub, DataDog, and Kubernetes. It can perform real-world tasks like spinning up Kafka topics, scaling environments, or opening Terraform pull requests.
Response: The agent completes the task, asks for more information if needed, or informs the user if it cannot fulfill the request. For sensitive operations, it can even trigger a manual approval flow before proceeding. The result is often a link to a GitHub pull request, ready for a final human check.

This system turns complex operations, like spinning up a full ephemeral environment, into a simple conversation, orchestrated by their custom GitOps tool, "Thor".

Sentinel: An AI SRE for Smarter Incident Management

Building on the success of AiBops, the SRE team developed Sentinel, an agent designed to reduce Mean Time To Resolution (MTTR) during incidents. Sentinel automates analysis, centralizes data from multiple sources, and breaks down knowledge silos that naturally form as a company scales.

How Sentinel Investigates an Incident:

When a DataDog monitor triggers an alert, a cascade of automated actions begins. Internal services

Gatekeeper and Dispatcher deduplicate alerts, trigger PagerDuty, and open a dedicated incident channel, notifying Sentinel that a new incident has started.

Once active, Sentinel can:

Analyze Thread Dumps: It can pull thread dumps from an S3 bucket from a specific time range (e.g., "10 minutes before the incident") and use AI to provide an analysis.
Investigate Logs & Changes: The agent can query DataDog for relevant logs or check for recent feature toggle changes and migrations that correspond with the incident's timeline.

Query Historical Incidents: This is Sentinel's most powerful feature. When asked, "Have we seen an incident like this before?", the agent queries a knowledge base built on AWS Bedrock. It can find similar past incidents, summarize their root causes, and suggest immediate actions, providing invaluable context in seconds.

This knowledge base is automatically populated. Throughout an incident's lifecycle, an application gathers all relevant data—metadata, logs, Slack conversations, and even RCA reports—and bundles it into a file uploaded to S3 after the incident is resolved. This rich, high-quality data is what allows Sentinel to provide such accurate insights for future incidents.

Under the hood, Sentinel is powered by LangGraph, a framework that allows for the creation of complex, stateful workflows. Based on the incident type, it routes tasks to specialized agents—one for performance issues and a "generalist" for others—that use a suite of tools to investigate.

The Future is Collaborative: The Rise of the First Responder

Hibob's vision doesn't stop with siloed agents. They are now developing

First Responder, an agent designed to handle Tier 1 support questions in the main DevOps channel.

This agent quietly listens in the background and, when it sees a question it can help with, jumps into the thread.

For knowledge-based questions like, "How do I troubleshoot Linkerd?", it queries a Bedrock knowledge base fed nightly by their internal Notion documentation.
For action-based requests like, "What is the status of PR #123?", it can use tools to check GitHub, identify a failed workflow, and ask the user if they want to fix it.

Crucially, First Responder can collaborate with other agents. If a user agrees to a fix, it can dispatch the task directly to HibobOps, which then handles the work of recreating the PR environment and re-running the workflow. This inter-agent communication is secured with Google's A2A protocol and uses NATS for real-time messaging, creating a truly collaborative AI workforce.

Conclusion: A New Era for DevOps

Hibob's two-year journey from a single hackathon project to a multi-agent AI platform is a testament to the transformative power of intelligent automation. By building their own solution, they have not only streamlined operations but have fundamentally reshaped their developer experience.

The key takeaway is that the role of DevOps is evolving. The most important skill in this new era is knowing how to navigate and leverage AI tools to build intelligent systems. As Ishai Halpert puts it, "You can do everything with AI...the world is in our hands". By embracing this mindset, Hibob has built more than just tools; they've built a cognitive edge that positions them at the forefront of the DevOps revolution.

The

Platformers