AI Data Bill of Materials (DBOM): Why AI Security Needs a Data Supply Chain

A Data Bill of Materials (DBOM) inventories the data feeding an AI model. Here is what belongs in one, why SBOM does not cover it, and how to start.

What Is a Data Bill of Materials (DBOM)?

A Data Bill of Materials, or DBOM, is an inventory of every dataset that fed an AI model. It records the source of each dataset, the processing steps applied, the sensitivity level, the contractual terms governing use, and the model or agent the data ultimately shaped.

Pranava Adduri, Co-Founder and CTO of Bedrock Data, described it directly on The Security Podcast of Silicon Valley with host Jon McLachlan (co-founder of YSecurity and Cyberbase.ai).

"You have your binary that you're inventing. There's the software dependencies of those binaries, and then there's the downstream dependencies of those binaries as well, and libraries as well. In that same way, models have a DBOM, a data bill of materials. What data made its way into the training phase? Where did that data come from? What sets of processing did that data go from? Original data to refined data to get it to a stage where it was training ready."

That working definition matters because most enterprise AI programs today have no usable inventory of the data feeding their models. They have spreadsheets, lineage tools for the data warehouse, and informal documentation, but nothing that survives an auditor asking which customer records ended up in the latest fine-tune.

A DBOM closes that gap by making the data supply chain explicit.

Why an SBOM Does Not Cover AI Training Risk

The Software Bill of Materials has had a decade of regulatory and industry attention. Executive Order 14028 in 2021 formalized SBOM expectations for federal software supply chains. CISA and NTIA have published guidance. Tools like Syft, Trivy, and the SPDX format make SBOM generation routine.

None of that helps when the risk lives in the data, not the code.

An AI model can be wrapped in clean, well-audited software and still expose the business to harm if the training data contained customer PII that should never have left a regulated zone. The binary is identical whether the model was fine-tuned on synthetic data or on a leaked production dump. The SBOM cannot tell you which one happened.

That is why the agentic AI security conversation now extends past the agent's permissions and into the training set itself. The model's behavior is a function of the data that built it. Without an inventory of that data, you are governing only the surface.

DBOM is the layer SBOM does not cover.

What Belongs in a DBOM

A useful DBOM lists more than dataset names. The minimum fields are:

  • Source. Where the dataset came from. Internal system, third-party vendor, public corpus, customer-uploaded, synthetic.

  • Licensing and contract terms. What the company is contractually allowed to do with the data. Granularity caps. Geographic restrictions. Retention limits.

  • Sensitivity classification. Public, internal, confidential, restricted, regulated. Tied to the company's existing classification scheme.

  • Processing steps. Each transformation between original collection and training-ready state. Anonymization, de-identification, sampling, filtering, augmentation.

  • Training stage. Pretraining, fine-tuning, supervised tuning, RAG grounding, evaluation set.

  • Consuming model or agent. The downstream artifact that inherited this data's risk profile.

  • Owner. A named person or team accountable for the dataset's accuracy and compliance.

Adduri's framing extends one step further. The DBOM should reach into the data vendor contracts themselves: "In certain domains where the company might be procuring data from data vendors, there might be terms around what granularity of data can be used when training a model."

A DBOM that lists only internal datasets misses the contractual risk that comes with the procured ones.

Tainted Training Data: The Failure Mode DBOM Prevents

The clearest failure mode is tainted training data. A dataset enters the training phase before its anonymization step ran. Customer email addresses, account identifiers, or partial credit card numbers end up baked into the model weights. The model passes evaluation. It ships. Six months later, an attacker uses prompt extraction techniques to surface the original PII.

This is not a theoretical risk. The same class of failure appears in published research on training data extraction across major commercial models. The DBOM is the control that catches it before the model is built.

George Gerchow, CSO at Bedrock Data, framed why a CSO cares: "What's the first thing that pops into mind during a security incident. Did any sensitive data ever get out. Is that data being misused, mishandled."

When the answer requires reconstructing the training pipeline from chat logs and Jira tickets, the answer arrives too late. When the answer is a DBOM query, it arrives in minutes.

The same logic applies to copyright and licensing exposure. A vendor dataset licensed for "research use only" that ends up in a production fine-tune is the kind of finding that ends careers. DBOM makes the provenance visible before the auditor asks.

From DBOM to Guardrails: The Gap Analysis Pattern

A DBOM on its own is documentation. It becomes a security control when it is paired with runtime guardrails on the model's output.

Adduri described the pattern Bedrock built into its ArgusAI product, launched in November 2025. "When you look at a model, on the left side compute the DBOM, what all went into it. On the right side, look at the guardrails to figure out what the guardrails enable, what they block, and what they allow. That gives you a gap analysis based on what's going into the model or the agent or the co-pilot. You have this potential of what might be coming out, what do the guardrails block, and then what are they still letting through."

The gap analysis pattern is simple. The DBOM tells you what kinds of sensitive content the model could possibly produce. The guardrails tell you what kinds of sensitive content the runtime filters will catch. The gap between those two is the exposure window. Anything sensitive that the DBOM proves was in training, that the guardrails do not actively block, is a known leak path waiting to be triggered.

Without a DBOM, the gap analysis cannot be computed. Without the gap analysis, guardrails are an act of faith rather than a control.

This is the same shape as the failure pattern in vibe coding security, where the absence of a clear inventory of what an AI agent touched made the resulting vulnerabilities hard to catch.

How to Start Building a DBOM Without Boiling the Ocean

Companies do not need a perfect DBOM on day one. They need a minimum viable DBOM that covers the highest-risk models first.

A workable starting point looks like this:

  1. List the models and agents already in production. Most enterprises underestimate this number. A real census usually finds 30 to 100 percent more AI workloads than the security team knew about.

  2. Pick the three highest-risk workloads. Use exposure to customer data, regulated workflows, and external-facing surface as the ranking criteria. The same scoping discipline that separates winning AI adoption strategies from the rest applies here.

  3. Document the data feeding those three workloads first. Source, licensing, sensitivity, processing, training stage, owner. Do not try to harmonize formats across the whole company. Pick a template and use it.

  4. Connect the DBOM to the guardrails for those three workloads. Compute the gap. Close the highest-severity items.

  5. Expand only after the first three are in steady state. A DBOM that covers three models well is more useful than a DBOM that covers fifty models poorly.

The pattern mirrors what works elsewhere in security. Inventory, classify, monitor, remediate. The novelty is that the inventoried asset is data, not endpoints. The same discipline applies.

A useful comparison is the broader conversation about which AI architectures even need a DBOM in the first place. Neuro-symbolic and ensemble AI architectures often pull data from many small specialized stores rather than one giant training set, which can simplify the DBOM scope substantially. The choice of AI architecture and the difficulty of the DBOM are linked.

Listen to the Full Episode

In episode 95 of The Security Podcast of Silicon Valley, Pranava Adduri, Co-Founder and CTO of Bedrock Data, and George Gerchow, CSO of Bedrock Data, walk through how their team thinks about data security at AI scale. They cover the DBOM concept in detail, the gap analysis pattern behind ArgusAI, and why a security leader's job has shifted from blocker to enabler when data governance is done well.

Adduri spent years at AWS watching the largest data environments on the planet struggle with the same question: which data matters most, and how do we use it without losing control of it. Gerchow brings the operator perspective from prior CSO and CISO seats at Sumo Logic, MongoDB, and VMware.

The full conversation is worth a listen for any practitioner working on AI security, data governance, or model risk management today.

What is a Data Bill of Materials (DBOM)?

How is a DBOM different from a Software Bill of Materials (SBOM)?

What should a DBOM include for an AI model?

Why is DBOM critical for AI security?

Meet the hosts

Jon McLachlan

Co-Founder, YSecurity & Cyberbase

Questions founders and engineers actually ask, with decisions not theater.

Questions founders and engineers actually ask, with decisions not theater.

Sasha Sinkevich

Co-Founder, YSecurity & Cyberbase

Pushes past surface answers into architecture, tradeoffs, and what scales.

Pushes past surface answers into architecture, tradeoffs, and what scales.

The Security Podcast of Silicon Valley

jon@thesecuritypodcastofsiliconvalley.com

The Security Podcast of Silicon Valley

jon@thesecuritypodcastofsiliconvalley.com