Recent reports describe how sophisticated threat actors use consumer AI tools like ChatGPT, Gemini, and Claude to support cyber campaigns. These tools were never designed for offensive security, yet they are increasingly part of real-world attack operations.
A Military-Style Approach to AI-Assisted Campaigns
The way attackers choose the right tool for each phase resembles a military-style operation.
- Reconnaissance: AI helps craft targeted spear-phishing from LinkedIn, GitHub, and company websites to gain initial access.
- Post-breach intelligence: LLMs mimic admin workflows and use native tools (PowerShell, ServiceNow, Salesforce) to map the environment and blend in.
- Lateral movement: Authentication attempts adapt to failures, probing misconfigurations and escalating privileges to reach adjacent systems.
- Data discovery: AI helps identify high-value data (IP, trade secrets, financial records).
- Exfiltration: Data exits through trusted cloud services (OneDrive, Dropbox, AWS S3) via encrypted, legitimate-looking traffic.
The real question
The attack chain is interesting. But the real question is: why would nation states and well-funded attackers rely on the same tools we use to spell-check emails or generate memes?
Why This Shouldn’t Work
At first thought, there should be good reasons not to use these tools:
Not designed for offense
These tools are not built for offensive security operations.
Actively restricted by vendors
The companies behind these products don’t want malicious use of their tools. They generally invest effort to prevent their tools being used for things they don’t want them to be used for.
Lack of target-specific context
Unlike purpose-built red-team tools, they’re not trained with domain-specific information.
Observable and centralized
Using popular, centrally operated services puts the actor at risk of being detected.
Risk of losing access
Using these tools might violate terms and conditions of these services.
Looking at the reasons why using these generic generative AI tools to hack others shouldn’t work, the alternative would be to build them. The military complex is known to build state-of-the-art technology for the purpose of offensive (usually ballistic) attacks against their enemies.
Nation states routinely invest tens of billions into bespoke military technology. Against that backdrop, building custom AI systems would be financially possible. Yet many actors still rely on consumer LLMs instead of developing and operating their own models.
While I don’t know if the US Military uses ChatGPT to support their operations, there certainly are a lot of offensive organisations that don’t have that type of budget.
But It Does
With all these reasons why generic off-the-shelf AI tools like ChatGPT and Gemini shouldn’t be suitable for offensive security operations (or many other activities deemed as malicious by the vendors), they seem to use them.
Why “Good Enough” Is Enough
The short answer is: because they can.
The more useful answer is: because consumer LLMs are often “good enough” — and they’re operationally convenient.
The key insight is that most cyber operations are not limited by access to secret exploits or exotic tooling. They are limited by cognition: the ability to process information, recognize patterns, adapt to feedback, and operate consistently over time. These are exactly the kinds of tasks modern language models excel at.
So, let’s dig into it a bit more. Why are these tools good enough for this purpose?
There are clear advantages to using tools relied on by billions of other people:
- Availability: Consumer LLMs are always on, globally accessible, and require no infrastructure or setup.
- “Good enough” performance: While not purpose-built, their speed and capability are sufficient for many stages of an attack, especially planning, automation, and content generation.
- Low friction: They are easy to use, require no specialized training, and integrate naturally into existing workflows.
- Scalability: LLMs dramatically reduce the cost and time of repetitive tasks, allowing attackers to operate at scale.
- Plausible deniability: Using tools like ChatGPT or Gemini does not inherently signal malicious intent, millions of legitimate users rely on the same services every day.
- Human amplification: These tools don’t replace skilled operators; they amplify them, making generic tools effective even in highly specialized contexts.
These points are valid for most situations, where “good enough” fits the needs for even very specialized use cases.
What’s special about AI?
While using what’s available, while “not perfect” is often more convenient and cheaper than to build something bespoke that fits very specific (current) needs, this also applies to large language models. It’s expensive and time-consuming to train a new language model.
Much of campaign work is generic: OSINT collection, content generation, summarization, translation, prioritization, and turning messy inputs into a plan. Consumer LLMs are strong at exactly these tasks, which reduces the need for bespoke systems.
What about domain expertise?
At first glance, this is where the argument should break down. Sophisticated cyber operations clearly require deep domain expertise: understanding systems, protocols, software stacks, and operational constraints.
But much of this expertise is not proprietary or secret. It is accumulated knowledge, patterns, and practices that are openly documented, discussed, and refined over time. Modern LLMs are trained on large volumes of publicly available technical material, including open-source code, developer documentation, standards, academic research, vulnerability disclosures, bug reports, and technical discussions from forums and Q&A sites.
As a result, they don’t just generate fluent text — they internalize common structures in how software is built, configured, broken, and fixed. This gives them a practical working knowledge of programming languages, system architectures, networking concepts, and common classes of vulnerabilities.
Crucially, much of offensive security does not depend on discovering entirely new techniques. It depends on combining known tools, misconfigurations, and behaviors in context, adapting them to a specific environment. This kind of synthesis — turning fragmented, well-known information into coherent plans and workflows — is exactly where language models perform well.
LLMs don’t replace domain experts. They compress experience, accelerate analysis, and reduce the cost of iteration. In the hands of skilled operators, that makes general-purpose models effective even in domains that appear highly specialized.
Why Use Multiple AI Tools?
There might be multiple reasons, why these organizations decide to use a variety of AI tools, rather than running their whole operation on a single AI tool:
- Some tools perform better for specific use cases: Running experiments can show that one tool outperforms another in a specific task or can provide the same quality for a lower price.
- Spread out the load: Not running all workloads on one system can reduce the costs or improve performance
- Stay undetected: By running smaller workloads on each of the systems, the risk of being detected might be reduced. There’s a reason cannabis-farms are often detected based on their electricity bill.
- Separation of information: Another motivation could be to compartmentalise the information processed and potentially visible to each provider.
Implications for defenders
The takeaway is not that attackers now possess unstoppable technology. It’s that defensive models which rely on detecting “exotic” tools or rare techniques are increasingly outdated. When attackers use the same platforms as legitimate users, intent and behavior matter more than tooling. Effective defense will depend less on identifying specific technologies, and more on detecting abnormal workflows, misuse of legitimate tools, and deviations from expected patterns.
Conclusion
It’s not surprising that widely available AI tools and language models are being used for seemingly specialized applications. Bespoke solutions require large upfront investment, huge amounts of computing capability, and don’t seem to deliver a better outcome.
While the use of generative AI allows these actors to step up their operational capability, it certainly doesn’t make them perfect or invincible.
Like any other capability, generative AI only becomes effective when embedded into a broader strategy and guided by skilled operators.
Consumer AI tools are not a shortcut to sophistication. They are an accelerant. Used well, they reduce friction, compress experience, and scale human intent. That is why they fit so naturally into modern cyber operations, and why understanding how they are used matters more than focusing on the tools themselves.
