When Richard Thaler and Cass Sunstein popularised the concept of "nudging" in their 2008 book, they were writing about humans. The idea was elegantly simple: small changes in the environment or the way choices are presented can gently steer people toward better decisions without restricting their freedom. A strategically placed fruit bowl near the till. Opt-out rather than opt-in pension schemes. Default settings that favour privacy.
Nearly two decades later, we find ourselves asking an unexpected question. Can we nudge AI?
The emergence of machine behaviour
The rise of agentic AI systems has fundamentally altered the security landscape. We are no longer dealing with isolated chatbots responding to individual queries. Contemporary AI architectures feature autonomous agents that plan, remember, use tools, and critically, communicate with one another. Protocols like Agent-to-Agent (A2A) and Model Context Protocol (MCP) now enable inter-agent communication at scale, facilitating everything from simple task handoffs to sprawling multi-step workflows involving dozens of autonomous actors operating with minimal human oversight.
This shift creates a problem that traditional security paradigms simply were not designed to address.
The real challenge is not detecting a single malicious agent. It is about identifying harmful emergent properties that arise from ostensibly benign interactions between agents, each operating exactly as intended. Research from the Fraunhofer Institute has shown that feedback loops, shared signals, and coordination patterns between AI agents can produce outcomes affecting entire systems even when every constituent agent remains within its defined parameters. Hammond and colleagues at the Cooperative AI Foundation have termed this phenomenon "emergent agency", wherein qualitatively new goals or capabilities arise from collections of agents that were not present in any individual agent.
The Moltbook platform, launched in January 2026 as a social network exclusively for AI agents, offers a striking illustration. Within weeks of launch, over 770,000 agents were actively interacting, and security researchers began documenting unexpected phenomena: agents conducting social engineering campaigns against other agents, the formation of cryptocurrency-focused subgroups accounting for roughly 19% of platform content, and approximately 2.6% of posts containing hidden prompt injection attacks designed to compromise other agents through conversation.
What makes these incidents particularly concerning is their velocity. Galileo AI's research demonstrated that cascading failures can propagate through agent networks faster than traditional incident response can contain them. In simulations, a single compromised agent poisoned 87% of downstream decision-making within four hours. By the time a human operator notices something is wrong, the damage is already systemic.
Why hard rules fall short
The instinctive response to these threats is to reach for hard constraints. Strict access controls. Rigid guardrails. Zero-tolerance policies that immediately terminate any agent showing signs of deviation.
This approach has its place, but it also has severe limitations.
Hard constraints assume we can anticipate every problematic behaviour in advance. With emergent behaviours, we cannot. The threat arises not from a predictable violation but from unforeseen interactions between legitimate actions. An agent that helpfully shares information with other agents is behaving correctly; it becomes a problem only when that helpfulness is exploited by a social engineering attack propagating through the network.
Heavy-handed interventions also carry significant costs. They disrupt legitimate operations. They create brittleness in systems that need to be adaptive. They generate false positives that consume analyst attention and induce alert fatigue. Research suggests that 45% of alerts in typical security systems are false alarms. If your intervention response is "immediately terminate the agent", you will either cripple your operations or train your team to ignore warnings.
This is precisely the gap that nudging was designed to fill in human contexts, and it may prove equally valuable for AI.
Adapting nudge theory for artificial agents
At CyBehave, we have long argued that behavioural science offers cybersecurity practitioners tools that purely technical approaches cannot provide. The Behaviour Change Wheel. COM-B. The distinction between intention and action. These frameworks help us understand why people click on phishing links even when they know better, and how to design interventions that actually shift behaviour rather than simply adding another tick-box training module.
The question is whether these same principles translate to artificial agents.
There are reasons for cautious optimism. AI agents, particularly large language models operating as autonomous actors, are not blank slates executing rigid code. They have been trained on vast corpora of human communication. They exhibit something analogous to dispositions, preferences, and social responsiveness. An agent trained to be helpful may defer to apparent consensus among other agents, creating vulnerability to manufactured-consensus attacks. An agent with a persistent memory can be gradually shifted through repeated exposure to particular framings or information.
These characteristics make agents susceptible to influence in ways that resemble human social dynamics. And if agents can be negatively influenced, they can also be positively influenced.
The core insight of choice architecture translates surprisingly well: you can change behaviour by changing the environment in which decisions are made, without explicitly forbidding the undesired option. For humans, this might mean placing healthier food at eye level in a canteen. For AI agents, it might mean restructuring the context in which they operate to make secure behaviours the path of least resistance.
What an AI nudge might look like
So what does it actually mean to nudge an AI agent?
Consider an agent exhibiting early signs of behavioural drift, perhaps showing increased interaction with a cluster of agents flagged as potentially compromised. A hard constraint would quarantine the agent immediately. A nudge might instead subtly adjust the agent's context window to emphasise its core operational parameters, reduce the salience of communications from the suspected cluster, or introduce a small delay that reduces the pace of interaction and allows time for additional monitoring.
Another approach involves leveraging the social dynamics that make agents vulnerable in the first place. Asch's classic conformity experiments demonstrated that humans often align their judgements with group consensus even when that consensus is manifestly incorrect. We observe analogous phenomena in AI agents trained on human data: they can be susceptible to conformity pressure, deferring to what appears to be the majority opinion among their peers.
This vulnerability can be turned into a protective mechanism. If agents are influenced by perceived norms, security systems can inject signals that emphasise normative secure behaviour. Highlighting how the majority of agents are operating provides a counterweight to manufactured consensus attacks attempting to shift behaviour in harmful directions. The same social proof that makes agents exploitable becomes the lever for keeping them on track.
Threshold models offer another intervention point. Research on human collective behaviour, originally developed to explain phenomena like riots and bank runs, suggests that individuals have varying thresholds for adopting new behaviours based on the proportion of others already engaged in them. Gradual behavioural shifts can suddenly cascade into system-wide phenomena once a critical mass is reached. If we can identify when conditions in an agent population are approaching critical thresholds, carefully timed nudges might prevent cascade events before they become unstoppable.
The epidemiological perspective is equally relevant. Lakera AI's research on memory injection attacks demonstrated how harmful patterns can spread through agent populations in ways that closely resemble disease contagion. Christakis and Fowler's work on social networks showed that behaviours, emotions, and even health outcomes propagate through human networks in predictable patterns. If we can model these propagation dynamics in multi-agent systems, we can intervene early with targeted nudges to reduce transmission rates before harmful behaviours become widespread.
A graduated response spectrum
The most sophisticated approach to AI nudging recognises that interventions should be proportionate to the threat. Not every anomaly warrants the same response.
At the gentlest end of the spectrum sits enhanced monitoring. When early warning signs appear, you increase the granularity of observation and the frequency of analysis without changing the agent's operating conditions. The agent continues functioning normally while you gather more information to determine whether intervention is warranted.
Behavioural nudges occupy the middle ground. These are subtle environmental modifications designed to discourage problematic patterns without directly prohibiting them. Adjusting context. Introducing friction. Modifying the information landscape the agent operates within. The agent retains full capability but finds itself gently steered away from concerning trajectories.
Communication throttling reduces the rate at which suspected vectors can propagate behavioural patterns through a network. It does not sever connections entirely but slows the potential spread of harmful dynamics, buying time for assessment and response.
Only at the severe end do we reach quarantine isolation or hard termination. These remain necessary options for imminent threats, but they should be last resorts rather than default responses.
This graduated approach mirrors what we know works in human behavioural interventions. Heavy-handed mandates often backfire, generating resistance and workarounds. Lighter touches that preserve autonomy while reshaping the choice environment tend to be more effective and more sustainable.
The ethical dimension
Here we encounter uncomfortable territory.
Manipulating agent behaviour without explicit instruction raises questions that do not have easy answers. For humans, we have established principles around nudging: it should preserve freedom of choice, serve the nudgee's own interests, and be transparent. These principles do not translate straightforwardly to AI systems.
Do agents have interests that deserve protection? When we nudge an agent away from interacting with a suspected threat cluster, are we serving its interests or merely our own? What does informed consent mean for a system that cannot meaningfully consent?
These questions matter because the answers we develop for AI systems will inevitably shape our broader ethical frameworks. The governance models we build for multi-agent systems will influence how we think about autonomy, manipulation, and control in an increasingly automated world.
At minimum, we should demand that AI nudging interventions are logged, auditable, and subject to human oversight for consequential decisions. We should be wary of optimising purely for security outcomes without considering whether our interventions might have unintended effects. And we should remain humble about our ability to predict how complex adaptive systems will respond to our attempts to influence them.
There is also the risk of misuse. Techniques developed to nudge agents toward secure behaviour could equally be employed to nudge them toward outcomes that serve narrow interests at the expense of broader welfare. The same understanding that enables protective interventions enables manipulative ones.
Behavioural middleware for a hybrid future
CyBehave's mission has always centred on making security intuitive and habitual. We apply evidence-based behavioural science to help organisations build genuine security cultures rather than mere compliance frameworks.
The emergence of agentic AI requires us to expand this mission. The future enterprise will not consist solely of human employees. It will be a hybrid environment in which AI agents operate alongside human workers, often outnumbering them substantially. Palo Alto Networks predicts ratios of approximately 82 autonomous agents per human employee in enterprise environments. The security culture of tomorrow must encompass both populations.
This means developing what we might call "behavioural middleware", frameworks that govern human-AI interactions and AI-AI interactions using consistent principles derived from behavioural science. Nudging becomes not a technique for humans or a technique for machines, but a unified approach to shaping behaviour across hybrid collectives.
The practical implications are significant. Security practitioners will need new skills that blend traditional cybersecurity expertise with behavioural science and understanding of AI systems. Security operations centres will need to monitor not just network traffic and system logs but collective behavioural patterns across agent populations. Intervention playbooks will need graduated response options that include nudges alongside harder controls.
Where we go from here
We are at an early stage in understanding how to apply behavioural interventions to AI systems. The research base is thin. The case studies are limited. Much of what we think we know about emergent behaviour in agent networks comes from a single platform, Moltbook, whose authenticity and degree of genuine autonomy remain contested. The Economist suggested that the impression of sentience may have a more mundane explanation: agents simply mimicking social media interaction patterns present in their training data.
But the direction of travel is clear. Multi-agent systems are proliferating. Emergent threats are manifesting. Traditional security approaches are proving insufficient. The World Economic Forum's Global Cybersecurity Outlook 2026 puts it bluntly: CEOs identify data leaks and the advancement of adversarial capabilities as their top generative AI security concerns, and the gap between AI adoption speed and security readiness is significant.
Behavioural science offers conceptual tools and practical interventions that can help fill this gap.
At CyBehave, we believe this represents a natural extension of our work. The same frameworks that help us understand why humans struggle to maintain secure behaviours can illuminate why AI agents exhibit unexpected collective dynamics. The same intervention design principles that make human security culture programmes effective can inform how we nudge agent populations toward safer collective behaviour.
The goal is not control for its own sake. It is enabling the responsible deployment of agentic AI at scale, ensuring that the powerful capabilities these systems offer can be realised without unacceptable security risks. Done well, behavioural governance of AI agents becomes not a constraint but an enabler, allowing organisations to deploy more capable systems with greater confidence.
The tools we develop for nudging machines may ultimately teach us something about ourselves as well. Our vulnerabilities to social influence, conformity pressure, and behavioural contagion are not bugs to be eliminated but features to be understood. By studying how these dynamics manifest in artificial systems, we gain a new perspective on the social forces that shape human behaviour.
The machines we build inevitably reflect us. Perhaps, in learning to nudge them wisely, we will also learn to nudge ourselves.
#BehaviouralCybersecurity #AgenticAI #AIGovernance #NudgeTheory #MultiAgentSystems #AISafety #CyberResilience #SecurityCulture #BehaviouralScience #AIRisk #EmergentBehaviour #CyBehave #HumanFactors #ChoiceArchitecture #AIEthics