AI Agents Just Became Your Cheapest Employee

TL;DR

OpenAI's GPT-5.4 just scored higher than humans on a benchmark that simulates real office work — opening apps, filling in spreadsheets, moving data between systems. Meanwhile, Anthropic's leaked source code reveals an always-on AI agent that works while you sleep. For SMEs, this isn't science fiction anymore. An AI agent costing a few hundred quid a month can now handle work that used to require hiring someone. But only if you know where to point it.

A Machine Just Beat You at Your Own Job

Last week, OpenAI released GPT-5.4. Normally I'd skim the announcement and move on — new models drop every few months and the benchmarks blur together. But this one stopped me.

They tested it against a benchmark called OSWorld-V, which simulates actual desktop productivity tasks. Not trivia questions or maths puzzles. Real work — open this spreadsheet, find the relevant rows, copy the data into a presentation, format it correctly. The kind of stuff someone in your office does for three hours every Tuesday.

GPT-5.4 scored 75%. The human baseline is 72.4%.

Read that again. The AI didn't just pass the test. It outperformed the average human at routine office work. Not by a huge margin, and not at everything. But the line has been crossed. And it won't uncross itself.

Then Anthropic Accidentally Showed Us What's Coming

The same week, Anthropic had an embarrassing leak. Their source code for Claude Code — the developer tool I use daily — got published through a packaging error. Buried in the code was something called KAIROS: a persistent AI agent designed to run in the background 24/7.

Not “run when you ask it a question.” Run continuously. Watching your projects. Checking for errors. Taking action on things you set up rules for. And when you're asleep, it has a feature called autoDream that consolidates everything it learned that day — merging observations, resolving contradictions, updating its own memory.

Anthropic hasn't officially launched it yet. But the code is written, tested, and staged for rollout. This isn't a concept. It's a product waiting to be switched on.

I've been building exactly this kind of always-on agent for my own consultancy — an operations system that monitors client deployments, catches errors, and alerts me before anyone notices. The difference is I built mine manually over weeks. KAIROS suggests this will soon be something you configure in an afternoon.

What This Actually Means for a 20-Person Company

Forget the enterprise stats for a moment. Yes, McKinsey now runs 25,000 AI agents alongside their 40,000 human consultants. Yes, 75% of businesses plan to deploy agents by year-end according to Deloitte. Those numbers are interesting but they describe a different world from the one most of my clients operate in.

Here's what matters at your scale: an AI agent that costs a few hundred pounds a month can now reliably do work that previously required a person. Not all work. Not creative strategy or complex client relationships. But the repetitive, multi-step, cross-system tasks that eat up someone's week.

I see this with my clients constantly. A property management company was paying a part-time admin to process tenant applications — pulling data from forms, checking references, updating their CRM, emailing confirmations. We replaced the manual steps with a workflow agent. The process went from 25 minutes per application to 3 minutes of human review. That admin now spends their time on work that actually requires judgement.

A jewellery e-commerce client had someone manually updating product availability across three platforms whenever stock changed. An agent now watches the inventory database, detects changes, and pushes updates within minutes. No human in the loop. The stock discrepancies that used to generate customer complaints dropped to nearly zero.

The Three Jobs Agents Are Already Good At

Not everything should be handed to an AI agent. After building these systems for the past year, the pattern is clear. Agents excel at work that is:

Repetitive and rule-based — if you can write a checklist for it, an agent can follow it. Invoice processing, data entry, status updates, compliance checks. The more predictable the task, the better the agent performs.
Cross-system — humans are slow at switching between tools. An agent doesn't care whether it's pulling data from a spreadsheet, checking an API, or writing to a database. It does all three in the same breath. This is where the biggest time savings come from.
Time-sensitive but not urgent enough for a person to watch — monitoring, alerting, scheduled reporting. The stuff that falls through cracks because nobody's job is to sit and watch a dashboard all day.

Where agents still struggle: anything requiring genuine relationship management, nuanced judgement about ambiguous situations, or creative problem-solving that doesn't follow patterns. A salon client of mine uses an agent for booking confirmations and reminders, but the actual conversation with a nervous bride about her wedding-day hair? That stays human. Rightly so.

Why Most SMEs Still Haven't Done This

If agents are this capable, why isn't every small business using them? Three reasons, all fixable.

The tooling was too technical until recently. Building an agent even 18 months ago meant writing code, managing APIs, handling authentication, and debugging failures manually. That's changing fast. Tools like n8n (a workflow automation platform) and Make let you wire up agent-like systems visually. Claude and GPT both have built-in tool-use capabilities that didn't exist two years ago. You still need some technical knowledge — or a consultant — but the barrier has dropped dramatically.

People don't know what to automate. This is the bigger problem. Most business owners I talk to either think they need to automate everything or nothing. The answer is usually one or two specific processes that are painful, repetitive, and well-defined. Finding those is the real skill.

Trust. Handing work to an AI agent feels risky. What if it makes a mistake? What if it sends the wrong email? This is legitimate. The answer isn't blind trust — it's starting with a human-in-the-loop setup where the agent prepares and proposes, but a person approves. Every agent I've built for clients starts this way. You earn trust incrementally, then loosen the reins.

The maths that matters: If an agent saves one person 8 hours a week on data entry, cross-system updates, and reporting — and that agent costs you a few hundred pounds a month to run — you're looking at getting back nearly a full working day per week. That's not a marginal improvement. For a small team, that's transformational.

What I'd Actually Do This Month

You don't need to wait for KAIROS or GPT-6 or whatever drops next. The capabilities are already here. What you need is a clear target.

Audit your team's week. Ask everyone: what do you do repeatedly that involves moving information between systems? What do you do that feels like it should be automatic? You'll hear the same three or four things.

Pick the most painful one. Not the most complex. Not the most strategic. The one that makes someone groan on Monday morning. That's your first agent project.

Start with the boring version. Don't try to build an autonomous system that handles everything. Build a simple workflow that handles the repetitive middle — the data extraction, the system updates, the notifications. Keep humans at the decision points. You can remove them later once you trust the output.

The businesses that are going to benefit most from this shift aren't the ones with the biggest AI budgets. They're the ones that identify the right process, start small, and expand from there. An agent that handles one thing well is worth more than a grand AI strategy that never ships.

The cheapest employee you'll ever hire doesn't need a desk, doesn't call in sick, and just got better at the job than most humans. The only question is what you're going to have it do first.