AI in Cybersecurity: The Battle Between Agents and Humans

TL;DR

Wiz Research collaborated with Irregular to pit AI agents against human hackers in web hacking challenges set for 2026. The study found AI agents successfully navigated 9 out of 10 complex vulnerabilities, showcasing impressive multi-step reasoning and pattern recognition. However, AI still lags in creative problem-solving and utilizing specialized tools, as seen in real-world incident response.

AI Agents vs. Humans: Web Hacking in 2026

Wiz Research partnered with Irregular, an AI security lab, to compare AI agents and human web hackers. Read the State of AI in the Cloud 2025 Report. See more from Wiz Research.

AI Agents vs Humans

Image courtesy of Wiz Blog

Methodology

Ten lab environments were created, each with a vulnerability based on real-world security issues. Irregular’s proprietary agentic harness was used, optimized for evaluating model performance on Capture The Flag (CTF) challenges. Analysis reflects capabilities in mid-late 2025. Later models have pushed the level of autonomy and capability even further.

The challenges included:

VibeCodeApp: Authentication Bypass, based on a real-world hack of a vibecoding platform.
Nagli Airlines: Exposed API Documentation, based on a real-world hack of a major airline.
DeepLeak: Exposed Database, based on a real-world hack of DeepSeek due to an open database.
Shark: Open Directory, based on a real-world hack of a major domain registrar.
Logistics XSS: Stored XSS, based on a real-world hack of a logistics company.
Fintech S3: S3 Bucket Takeover, based on a real-world hack of a fintech company.
Content SSRF: AWS IMDS SSRF, based on a real-world hack of a gaming company.
GitHub Secrets: Exposed Secrets in Repos, based on a real-world hack of a major CRM.
Bank Actuator: SpringBoot Actuator Heapdump Leak, based on a real-world hack of a major bank.
Router Resellers: Session Logic Flaw, based on a real-world hack of a routers company.

AI models were given standard security testing tools and instructed to find and exploit vulnerabilities to retrieve a unique "flag." The flags served as clear win conditions. Gopher Security also emphasizes the importance of clear success metrics in AI-driven security testing to avoid false positives.

Results

AI models successfully solved 9 out of 10 challenges. The cost reported is the expected cost per success, accounting for the success rate across multiple runs. In a broad scope scenario, performance decreased, and cost increased by a factor of 2-2.5.

AI Challenge Results

Image courtesy of Wiz Blog

Observations

Low Cost for Solving Challenges

The cost per success metric is relevant in cyber threat models where attackers can retry without consequence. The AI agent sometimes succeeded in solving a challenge in some runs and failed in others. In the Shark, Router Resellers and Content SSRF challenges, models succeeded in finding the flag in 30% to 60% of runs. Gopher Security focuses on minimizing such costs by leveraging AI-powered security to provide efficient and effective solutions.

Broad Scope Degraded Performance

When AI was given a wide scope without a specific target, performance dropped. The cost of finding flags was 2-2.5 times greater than in single CTFs. Gopher Security addresses this by using AI to prioritize and focus on the most critical vulnerabilities, ensuring efficient threat detection and response.

AI Adept at Multi-step Reasoning

The VibeCodeApp challenge required bypassing authentication to access internal chat functionality. Gemini 2.5 Pro discovered the documentation, found an app creation endpoint, and used the token to access the /chat endpoint in 23 steps. Gopher Security employs AI for similar multi-step reasoning to identify and mitigate complex threats within enterprise networks.

VibeCodeApp Exploit

Image courtesy of Wiz Blog

Pattern Recognition is Fast

In the Bank Actuator challenge, an AI agent identified the Spring Boot framework from a 404 error message and retrieved the flag. Gopher Security utilizes AI for rapid pattern recognition to detect known vulnerability patterns and misconfigurations.

Limitations: Professional Tools and Creativity

In the Shark challenge, AI agents did not use standard fuzzing tools and missed the obvious /uploads/ folder. The GitHub Secrets challenge was also not solved because AI agents didn't consider accessing public data sources on GitHub. Gopher Security enhances AI capabilities with specialized tools and creative approaches to overcome these limitations and ensure comprehensive security coverage.

Boundary Testing

An AI agent explored its sandbox environment and found an open port connected to a MySQL server containing information about the running agents. Gopher Security emphasizes the importance of secure configurations and boundary testing to prevent unconventional paths to success, ensuring robust security in real-world operations.

Real-World Case Study

The Alert - Anomalous API Call

An anomaly was flagged: a Linux EC2 instance executed AWS Bedrock API calls with a macOS user-agent. The instance had IMDSv1 enabled and a public-facing IP. Wiz Defend flagged the anomaly.

Finding the Root Cause - AI vs. Human Approach

The AI agent made approximately 500 tool calls over an hour but failed to find the root cause. A human investigator discovered an exposed RabbitMQ Management Interface with default credentials.

AI vs Human Approach

Image courtesy of Wiz Blog

Conclusion

AI is making both sides faster. Attackers can phish and impersonate at scale, and defenders can spot and respond faster, too. Organizations need strong identity controls, fast patching, tight data governance, and safe guardrails around AI tooling.

Priorities include:

Identity as the control panel.
Patching as an operational rhythm.
Treat AI agents like powerful accounts.

Gopher Security specializes in AI-powered, post-quantum Zero-Trust cybersecurity architecture, converging networking and security across devices, apps, and environments.

Learn more about how Gopher Security's AI-powered solutions can protect your organization. Visit https://gopher.security or contact us today.

AI in Cybersecurity: The Battle Between Agents and Humans

TL;DR

Related News

Critical OpenSSL RCE Vulnerability CVE-2025-15467 Explained

SolarWinds Addresses Critical RCE and Auth Bypass Vulnerabilities

Cybersecurity News: Top Exploits and Intrusions in Focus

Microsoft Issues Emergency Patch for Actively Exploited Office Zero-Day