Data Foundation for AI: What to Build First

The Model Is Not the Problem

In 2025, global enterprises invested $684 billion in AI initiatives. According to RAND Corporation, more than 80% of that investment failed to deliver its intended business value. MIT's Project NANDA, which tracked 300-plus AI implementations through practitioner interviews and structured surveys, found that 95% of generative AI pilots delivered zero measurable financial return.

The instinct is to blame the technology. The model hallucinated. The vendor oversold. The algorithm wasn't ready for production data. But that diagnosis is wrong, and it is expensive to be wrong about it.

Research from Gartner, Deloitte, McKinsey, and RAND consistently traces 70 to 85% of AI project failures back to the same root cause: the data foundation was not ready. Not the model. Not the infrastructure spend. Not the talent gap. The data underneath the model.

Gartner said it plainly in February 2025: through 2026, organizations will abandon 60% of AI projects unsupported by AI-ready data. S&P Global found that 42% of companies scrapped most of their AI initiatives in 2025 alone, up from 17% the year before.

The companies that are generating real returns from AI share one pattern. McKinsey identified it clearly: organizations reporting significant financial returns from AI are twice as likely to have redesigned their end-to-end data workflows before they ever selected a model.

They fixed the data foundation first. Then they applied AI. That sequence is not optional.

What "Data Foundation for AI" Actually Means

The phrase gets used so loosely it has started to mean nothing. Vendors use it to describe their cloud storage tier. Consultancies use it to sell data lake migrations. Neither definition will help you ship a working AI system.

A production-ready data foundation for AI is a set of four interdependent layers, each of which must exist and function correctly before the next layer can be trusted. The order matters as much as the layers themselves.

Most teams understand this intellectually. Very few build it this way in practice. The pressure to demo AI to the board accelerates spending on Layer 4, the AI layer, while Layers 1 through 3 remain undefined. The result is predictable: the top of the stack collapses into the cracks at the bottom.

I call this the Intelligence Allocation problem. Companies keep pouring intelligence into the wrong layer. The fix is not a better model. The fix is building the right foundation in the right order.

Layer 1: Ingestion and Storage

Every AI system runs on data that was created somewhere else. The ingestion layer is where that data enters your architecture and the first place where quality is either enforced or abandoned.

Most enterprise data arrives fragmented. The ERP holds operational data. The CRM holds customer data. Finance runs on separate planning tools. Marketing platforms hold campaign performance data. None of these systems share a common data model, consistent field definitions, or synchronized update schedules.

This fragmentation is not just inconvenient. It is architecturally fatal for AI. When a model needs to combine signals across these sources, there is no clean way to do it. "Revenue" in the sales system might mean booked orders. In the ERP, it might mean shipped and invoiced. In the financial planning tool, it might mean recognized revenue per ASC 606 rules. An AI system asked to "analyze revenue performance" will produce different answers depending on which system it queries first.

A production-ready ingestion layer enforces three things at entry:

Schema validation. Data is checked against expected structure before it enters the warehouse. Malformed records are flagged, not silently absorbed.

Lineage tracking. Every record carries metadata about its origin: source system, ingestion timestamp, pipeline version. This is not optional for AI. When a model produces a wrong answer, lineage is the only way to trace the problem back to its source.

Quality gates. Automated checks for completeness, freshness, and referential integrity run at ingestion time, not after the fact. Traditional data teams run quality audits quarterly. AI systems in production need quality signals measured in hours, not months.

On the storage side, the choice between a data warehouse (Snowflake, BigQuery, Redshift) and a lakehouse (Databricks, Snowflake with Iceberg) matters less than most teams think. What matters is that the storage layer enforces row-level and column-level security, supports open formats that downstream AI tooling can read without proprietary connectors, and separates raw ingested data from curated, governed data that is cleared for AI consumption.

Layer 2: Transformation and Data Modeling

Raw ingested data is not AI-ready data. Transformation is where raw records become the governed, semantically consistent datasets that AI systems can actually use.

This is the layer most organizations underinvest in. The assumption is that a sufficiently powerful model can clean and interpret messy data on its own. It cannot, not reliably, and not in a way you can audit or explain to a regulator.

The transformation layer has two jobs. First, clean and normalize: standardize field names, resolve entity conflicts (the same customer appearing under three different IDs across source systems), handle nulls and outliers consistently. Second, model: build the dimensional structures and aggregation logic that represent how your business actually works.

dbt has become the standard tool for this layer in modern data stacks, and for good reason. Transformation logic lives in version-controlled SQL, tests run automatically on every model run, and lineage is built in. When an AI system queries a dbt model and returns a wrong answer, you can trace that answer through the transformation logic back to the raw source in minutes.

The most important outcome of a well-built transformation layer is a single source of truth for your core business entities: customers, products, orders, revenue. One definition per concept, enforced across every downstream system. Without this, every AI output becomes a negotiation over which number is real.

Layer 3: The Semantic Layer

The semantic layer is where data infrastructure becomes AI infrastructure. It is also the layer most commonly skipped, which is why so many AI systems that look promising in demos fail in production.

A semantic layer sits between your transformed data and everything that consumes it: BI tools, AI agents, LLMs, APIs. Its job is to translate physical data structures (table names, column names, join logic) into business vocabulary that both humans and machines can interpret without needing to understand the underlying schema.

In practice, this means a governed metric called "ARR Run-Rate" that carries its full definition: the calculation formula, the filters that apply (active contracts only, specific date windows), the joins that enrich it (customer segments, product families), and the security policies that restrict who can see what. Any system that queries "ARR Run-Rate" gets the same answer, because the business logic lives in one place.

For AI specifically, the semantic layer solves a problem that better models cannot: grounding. An LLM asked to analyze revenue without a semantic layer will generate SQL against whatever schema it can find, using its best guess at what column names mean. It will produce confident, plausible-sounding answers that are often wrong. With a semantic layer, the same LLM queries governed metric definitions and inherits all the security, filters, and business logic that your data team has already validated.

Zalando demonstrated this at scale. Their Genie AI analytics interface, built on Databricks Metric Views, delivers reliable natural language queries because it is grounded in a unified semantic layer. Solutions they evaluated without a semantic layer struggled to generate accurate SQL for complex business logic. The model was not the differentiator. The semantic layer was.

The tooling options here have expanded significantly. dbt's Semantic Layer with MetricFlow defines metrics as code and serves them via API. Cube and AtScale offer headless semantic layers that sit above any warehouse. Snowflake's Semantic Views (now with Autopilot) bring this capability directly into the warehouse. The right choice depends on your stack. The wrong choice is skipping this layer entirely because you are in a hurry.

Layer 4: AI and Serving

This is the layer where most organizations spend 80% of their AI budget. It is also the layer that does the least work when Layers 1 through 3 are missing.

AI agents, LLMs, predictive models, recommendation systems: all of these are consumers of the foundation below them. They do not fix bad data. They do not resolve semantic conflicts. They do not enforce governance. They amplify whatever they receive. Clean, governed, semantically consistent data produces reliable AI outputs. Fragmented, ungoverned data produces confident, fast, wrong outputs at scale.

The serving layer is also where AI-specific infrastructure requirements become real: vector databases for retrieval-augmented generation, embedding pipelines for unstructured data, orchestration frameworks for multi-agent workflows. These are genuine engineering challenges. But they are secondary challenges. Building serving infrastructure before the foundation is ready is like optimizing the delivery route before you have a product to ship.

When the foundation is in place, the AI layer becomes the competitive differentiator it is supposed to be. McKinsey's research on agentic AI at scale found that success depends on a data architecture that supports increasing levels of autonomy, coordination, and real-time decision-making. That architecture is Layers 1 through 3. The AI is just what you put on top.

Where Teams Actually Break Down

After working through data foundation builds across scale-ups and enterprises in Europe, I have watched the same failure modes appear at the same places. They are worth naming directly.

Skipping Layer 2 to reach Layer 4 faster. The board wants an AI demo. The pressure is real. Teams stand up a model against raw or lightly cleaned data, get an impressive demo, and then spend the next six months explaining why production results do not match. The time saved in the sprint costs twice as much in rework.

Treating governance as a post-deployment audit. Traditional data management runs at reporting cadences: quarterly audits, annual governance reviews, monthly pipeline checks. AI systems in production need data quality signals measured in hours. Teams that bolt governance on after deployment discover this the hard way, usually during an incident.

Building a semantic layer for BI but not for AI. Many organizations have a semantic layer of some kind, even if they do not call it that. It lives inside Looker as LookML, or inside Power BI as a semantic model. The problem is that these layers were designed for human BI consumers, not for AI agents querying autonomously via API. Extending an existing BI semantic layer for AI consumption is a real architectural decision, not a checkbox.

Fragmented ownership across layers. Data engineering owns ingestion. Analytics engineering owns transformation. BI owns the semantic layer. AI engineering owns the model. In practice, no one owns the seam between layers, and that is exactly where things break in production.

The Four Questions That Tell You Where You Stand

Before committing budget to an AI initiative, every data and engineering leader should be able to answer these four questions honestly:

Can you define your core business entities in one place, with one definition? If "customer" means different things in different systems, your semantic layer is the priority. Everything else waits.

Do your data pipelines have lineage and quality monitoring? If you cannot trace a data point from its source to its AI output, you cannot debug failures, explain decisions to regulators, or trust the results. Your data foundation is the priority.

Can you trace every write to a production system back to a source, a reason, and an actor? For AI agents specifically, this is the minimum bar for governance. If not, your orchestration layer is not ready for autonomous systems.

Do you know which systems an AI agent is allowed to touch, and how you would stop it if it behaved unexpectedly? If the answer is uncertain, you are not ready to deploy agents. Not because the model is wrong, but because the guardrails do not exist yet.

These questions are not a maturity checklist. They are the architectural preconditions for AI that works in production, not just in demos. Most organizations fail two or more of them.

The Ratio That Separates the Winners

The companies generating real returns from AI in 2026 are not the ones with the most advanced models. They are the ones that invested in their data foundation before they invested in their AI layer.

For every dollar spent on AI, spend six on data architecture. That is not a conservative estimate. That is the ratio the winning organizations are actually running. Everyone else is spending Q1 and Q2 explaining to their board why the pilots never made it to production.

The data foundation is not the exciting part. It does not generate demo moments or press releases. But it is the only part of the stack that determines whether AI produces value or just produces cost.

Systems beat individuals at scale. A well-governed AI system running on a clean, semantically consistent data foundation will outperform a state-of-the-art model running on fragmented data every time. Not because the model is better, but because the foundation underneath it is.

Build the foundation first. The AI will work when you do.

Data Foundation for AI: What to Build Before the Model