Every founder we speak to in 2026 has the same backlog: a dozen workflows that should be AI-assisted, and one engineer who is already overloaded. The question is whether to hire an in-house AI engineer or outsource the buildout. This is the framework we use with our clients.
What does "AI automation" actually mean in 2026?
It splits into three buckets that have very different cost and risk profiles:
- Workflow automation with LLM steps — n8n / Make / Zapier with an OpenAI or Anthropic call in the middle. Customer support triage, lead enrichment, invoice extraction.
- Custom AI agents — multi-step agents with tools, memory and evaluation. Sales BDR agents, internal research agents, document review.
- LLM features inside your product — RAG search, AI-assisted onboarding, semantic filters. These ship inside your codebase, not as a sidecar.
The decision to outsource or hire in-house depends heavily on which bucket you are in.
When does in-house win?
Hiring an in-house AI engineer is the right call when at least three of these are true:
- You are in bucket 3 (LLM features inside the product) and they will be a permanent surface area.
- Your AI quality is a competitive moat — eval data and prompt iteration belong inside the company.
- You have at least 9-12 months of runway and can absorb a 60-90 day ramp.
- You already have one strong engineering manager who can review AI work credibly.
- You expect the AI surface to grow into a team of 3+ in the next 18 months.
If those are true, an in-house senior AI engineer at USD 180k-240k base in the US (or USD 60k-90k all-in if you hire offshore) pays back inside year one.
When does outsourcing win?
Outsourcing is the right call when at least three of these are true:
- You are in bucket 1 or 2 (workflow automation or sidecar agents) and the work is project-shaped, not surface-shaped.
- You need it shipped inside 6-10 weeks and cannot wait 90 days for a hire to ramp.
- The work spans skill areas a single hire would not cover — prompt engineering, infra, vector stores, evals, and frontend glue.
- You are still validating which workflows are worth automating. Outsourcing lets you ship three experiments for the cost of one hire.
- You do not yet have an in-house engineering manager who can credibly review LLM work.
What does it actually cost either way?
| Approach | Year-1 cost (USD) | Time to first ship | Risk profile |
|---|---|---|---|
| In-house senior (US) | 220,000 - 320,000 | 10-14 weeks | Hiring risk + retention risk |
| In-house senior (offshore direct) | 72,000 - 120,000 | 10-14 weeks | Same risks, lower stakes |
| Outsourced project (3 workflows) | 35,000 - 65,000 | 4-6 weeks | Vendor lock + handover risk |
| Outsourced retainer (ongoing) | 72,000 - 144,000 | 2-3 weeks | Scope drift if not capped |
What is the hybrid play that actually works?
Two-thirds of our AI clients run the same pattern: outsource the first 2-3 workflows to learn what is worth automating, then hire in-house once a clear winner emerges. We have shipped this pattern with founders who later hired the engineer who replaced us — and that is fine. Our job is to make their product better, not to keep the retainer.
The reverse pattern — hire first, then outsource overflow — almost always wastes 2-3 months because the hire gets stuck on the first workflow and the rest of the backlog stalls.
What are the hidden ongoing costs of an AI automation?
The upfront build cost is a fraction of what an AI automation actually costs to run. The line items founders miss:
- Token spend — production volume scales the model bill. For a customer-support agent handling 800 conversations/day on a GPT-4-class model, budget USD 600-1,400/month at 2026 rates.
- Vector database storage — Pinecone, Weaviate or pgvector add USD 0-300/month depending on scale.
- Eval re-runs after every prompt change — each golden-set run is a few dollars of inference. If you change prompts weekly, this is real.
- Human-in-the-loop reviewers — 1-2 hours of someone's time per week to review sampled output. This is the line item founders underestimate the most.
- Model deprecation — OpenAI and Anthropic deprecate models every 6-12 months. Budget at least one engineer-week per year for model migrations.
Together, ongoing run costs typically land at 15-25% of the original build cost per year. Plan for it in the original proposal, not as a surprise in month four.
How do you evaluate whether the AI automation actually works?
This is the part both in-house and outsourced teams underinvest in. Set this up before you write a single prompt:
- Golden set — 50-200 real examples with expected outputs. No LLM project ships without this.
- Eval harness — automated scoring against the golden set, ideally with both rule-based and LLM-as-judge scoring.
- Production sampling — log a random 1-5% of production runs into a review queue.
- Weekly review — 30 minutes, founder + engineer, walk through 10 sampled cases.
If a vendor pitches an AI buildout and does not mention evals, walk away. We have rescued three projects in the last year where the previous vendor shipped a "working" agent that was correct 40% of the time and nobody knew because there was no harness.
What about no-code platforms like Lindy, Relevance AI, n8n?
They have a real place. Use them when:
- The workflow is genuinely simple — under 10 steps, one or two integrations, no complex branching.
- You have an operations person who will own the workflow, not an engineer.
- You are testing a hypothesis and will rewrite in code if it works.
Do not use them when (a) the workflow touches sensitive data and you cannot prove a Data Processing Agreement, (b) the cost scales linearly with usage and your usage is about to 10x, or (c) you need to embed the output inside your own product UI. In those cases custom code is cheaper and more controllable inside 6 months.
What does our AI automation engagement look like?
We run 4-week sprints. Week 1 is workflow audit and eval setup. Weeks 2-3 are build. Week 4 is production rollout and a knowledge-transfer day with your team. After that, you can keep us on a capped retainer or take it in-house — we hand over the repo, the eval set and the runbook either way.
What are the most common AI automation projects we ship in 2026?
| Project shape | Typical stack | Time to ship | Cost band (USD) |
|---|---|---|---|
| Inbound lead triage and enrichment | n8n + GPT-4 class model + HubSpot | 3-4 weeks | 9,000 - 16,000 |
| Customer-support first-touch agent | Custom Python agent + RAG over Intercom + handover | 5-7 weeks | 22,000 - 38,000 |
| Document extraction (invoices, contracts) | Vision-capable LLM + Postgres + queue | 4-6 weeks | 14,000 - 28,000 |
| Sales BDR agent (outbound) | Apollo / Clay + LLM + CRM webhook | 4-5 weeks | 16,000 - 28,000 |
| Internal research agent (RAG over your data) | Vector DB + LLM + Next.js UI | 6-8 weeks | 28,000 - 48,000 |
What is the failure rate of AI automation projects?
Higher than founders expect. Industry surveys from 2025 put the production-deploy rate for AI pilots at roughly 30-40%. The pattern we see in rescued projects is consistent: no eval harness, no golden set, no humans-in-the-loop review, and a budget that ran out before the agent worked reliably. The fix is upfront — spend 15-20% of the budget on evaluation before you spend a dollar on prompts.
Ready to scope your AI workload?
If you have a list of workflows and are not sure which ones to automate first, send them to us on our quote page — we reply within 48 hours with a prioritised plan. You can also see a few of our shipped AI projects on our portfolio.