Vendoring AI-Generated Code: SBOM and License Implications

A senior engineer pastes a 12-line snippet from a Stack Overflow answer into your codebase. License: CC BY-SA 4.0. Your SBOM never hears about it. The function does its job for two years. Then your acquirer's diligence team runs Black Duck and finds a copyleft contamination warning.

A coding agent does the same thing one hundred times an hour, except it rewrites the snippet just enough that signature-based scanners miss it. The agent does not know the snippet's license. Your SBOM still does not hear about it. The contamination warning, when it comes, is harder to trace.

This article is about what your SBOM actually says when an autonomous agent paraphrases someone else's code into your repo, and how to keep license attribution chains intact in a world where the agent does not know — or care — what license the source carried.

Three Ways an Agent Vendors Code (and Why SBOMs Miss All Three)

A coding agent injects third-party code through three paths. Standard SBOM tooling sees zero of them.

1. Inline rewrite of a library function. The agent reads the source of lodash.debounce from its training data, paraphrases it into your repo as utils/debounce.ts, and never adds lodash to package.json. Syft scans package.json, sees no lodash, reports no lodash dependency. The SBOM is technically correct and operationally useless — your codebase contains MIT-licensed code from lodash with no attribution.

2. Copy-paste from a code-sharing site. The agent retrieves a Stack Overflow answer, GitHub Gist, or blog post snippet and inlines it. License is whatever the source carried — CC BY-SA 4.0 for Stack Overflow content, often unclear for blog posts. No SBOM tool tracks "code that came from a URL the agent retrieved during inference."

3. Substantial paraphrase of a known package. The agent rewrites enough of axios interceptors that signature-based scanners (Black Duck, FOSSA, ScanCode) miss the match. The function does what axios does, looks like axios at the abstract syntax tree level, but is just different enough to defeat naive comparison.

The common thread: in all three cases, the SBOM emitted by syft, cdxgen, or npm sbom reflects the manifest, not the source. If a dependency is not in package.json, pyproject.toml, pom.xml, go.mod, or Cargo.toml, it is not in the SBOM. Vendored code by definition is not in the manifest.

What CycloneDX Lets You Do About It

CycloneDX 1.6 added explicit support for AI-attributed components and inlined source. The relevant fields:

{
  "bomFormat": "CycloneDX",
  "specVersion": "1.6",
  "components": [{
    "type": "library",
    "bom-ref": "vendored:utils/debounce.ts",
    "name": "debounce-utility",
    "version": "0.0.0-vendored",
    "scope": "required",
    "licenses": [{
      "license": {
        "id": "MIT",
        "acknowledgement": "declared",
        "text": {
          "contentType": "text/plain",
          "encoding": "base64",
          "content": "Q29weXJpZ2h0IChjKSAyMDE2LTIwMjUgRGFu..."
        }
      }
    }],
    "properties": [
      { "name": "crashoverride:authored-by", "value": "autonomous_agent" },
      { "name": "crashoverride:agent-id", "value": "copilot-coding-agent" },
      { "name": "crashoverride:agent-model", "value": "gpt-4-turbo-2024-12" },
      { "name": "crashoverride:source-package", "value": "[email protected]" },
      { "name": "crashoverride:source-similarity", "value": "0.87" },
      { "name": "crashoverride:vendoring-method", "value": "inline-rewrite" }
    ],
    "evidence": {
      "identity": [{
        "field": "name",
        "confidence": 0.87,
        "methods": [{
          "technique": "ast-similarity",
          "confidence": 0.87,
          "value": "abstract syntax tree match against [email protected]"
        }]
      }]
    }
  }]
}

Two CycloneDX features earn their keep:

evidence.identity[].methods[] lets you record how you concluded this vendored chunk traces back to a known package, with a confidence score. AST similarity is the most defensible signal for paraphrased code.
properties[] with a vendor-specific namespace (crashoverride:) carries the AI-attribution metadata the CycloneDX SBOM CI/CD article introduced.

The SBOM now records: this function was authored by an agent, originally derived from a known MIT-licensed package, with 87% AST similarity. Your downstream license compliance scan can act on it.

Building the Detection Pipeline

The detection workflow has four stages: scan agent-authored code, run AST-similarity matching, pull the source license, and inject vendored components into the SBOM.

Stage 1: Identify Agent-Authored Files

If you've already implemented the tag-track agent code in git pattern, you have commit metadata. Pull the file list:

# Files touched by agent commits in the last release
git log "$PREV_TAG"..HEAD \
  --format='%H' \
  --grep '^Agent-Authored:' \
  | xargs -I {} git diff-tree --no-commit-id --name-only -r {} \
  | sort -u > agent-files.txt

Stage 2: AST-Similarity Scan

semgrep and comby work for syntactic patterns; for true AST similarity against a corpus of known open-source packages, the open-source nicad6 or commercial Black Duck Snippet Matching are the established tools. A minimal semgrep rule that catches a paraphrased lodash debounce:

rules:
  - id: vendored-lodash-debounce
    pattern-either:
      - pattern: |
          function $FN($FUNC, $WAIT, $OPTIONS) {
            let $TIMEOUT, $RESULT, $LASTARGS, $LASTTHIS, $LASTCALLTIME;
            ...
            function invokeFunc($TIME) {
              const args = $LASTARGS;
              ...
            }
            ...
          }
    message: "Likely paraphrase of lodash.debounce — record in SBOM"
    languages: [typescript, javascript]
    severity: WARNING
    metadata:
      source-package: lodash.debounce
      source-license: MIT

Run it across the agent-authored file set:

semgrep --config rules/vendored-paraphrase.yml \
  --json \
  --include-from agent-files.txt \
  --output vendored-matches.json

For deeper coverage, layer in a structural-similarity tool. The goal: every chunk the agent vendored from a known package gets a match with a source-package label and a similarity score.

Stage 3: Resolve Source License

Once you have a source-package label, resolve its license. The npm registry exposes this via the package-metadata endpoint:

PACKAGE="lodash.debounce"
VERSION="4.0.8"

LICENSE=$(curl -s "https://registry.npmjs.org/${PACKAGE}/${VERSION}" \
  | jq -r '.license')

LICENSE_TEXT=$(curl -s "https://raw.githubusercontent.com/lodash/lodash/${VERSION}/LICENSE" \
  | base64)

For PyPI, the same pattern via https://pypi.org/pypi/{pkg}/{version}/json. For Go modules, via the pkg.go.dev API. For Maven, via Maven Central's REST API.

Stage 4: Inject into the SBOM

Take the existing manifest-based SBOM, walk the vendored-matches.json output, and merge:

cdxgen -o sbom-base.cdx.json .

jq --slurpfile matches vendored-matches.json '
  .components += [
    $matches[0].results[]
    | {
        "type": "library",
        "bom-ref": ("vendored:" + .path),
        "name": (.extra.metadata."source-package" + "-vendored"),
        "version": "0.0.0-vendored",
        "scope": "required",
        "licenses": [{
          "license": {
            "id": .extra.metadata."source-license",
            "acknowledgement": "declared"
          }
        }],
        "properties": [
          { "name": "crashoverride:authored-by", "value": "autonomous_agent" },
          { "name": "crashoverride:source-package", "value": .extra.metadata."source-package" },
          { "name": "crashoverride:vendoring-method", "value": "inline-rewrite" }
        ],
        "evidence": {
          "identity": [{
            "field": "name",
            "confidence": 0.85,
            "methods": [{
              "technique": "ast-similarity",
              "confidence": 0.85,
              "value": ("semgrep rule " + .check_id)
            }]
          }]
        }
      }
  ]
' sbom-base.cdx.json > sbom-enriched.cdx.json

The enriched SBOM now lists vendored components alongside manifest dependencies. Downstream license-policy enforcement sees them.

License Policy Enforcement

With vendored components in the SBOM, your existing license policy can act. A FOSSA or license-checker-style policy that fails the build on copyleft contamination:

# Extract every license from the SBOM
jq -r '.components[] | .licenses[]?.license.id // empty' sbom-enriched.cdx.json \
  | sort -u > seen-licenses.txt

# Compare against allowed list
DISALLOWED=$(comm -23 seen-licenses.txt allowed-licenses.txt)

if [ -n "$DISALLOWED" ]; then
  echo "FAIL: license policy violation — disallowed licenses present:"
  echo "$DISALLOWED"

  # Surface the agent-authored components carrying the bad license
  jq -r --arg LIC "$DISALLOWED" '
    .components[]
    | select(.licenses[]?.license.id == $LIC)
    | select(.properties[]?.name == "crashoverride:authored-by")
    | "  " + .["bom-ref"] + " (agent: " + (.properties[] | select(.name == "crashoverride:agent-id") | .value) + ")"
  ' sbom-enriched.cdx.json
  exit 1
fi

This now flags Stack Overflow content (CC BY-SA 4.0) inlined by an agent before it reaches production — something a manifest-only SBOM scanner would never catch.

The Stack Overflow Problem

Stack Overflow content is CC BY-SA 4.0. The Share-Alike clause means derivative works must be licensed under a compatible copyleft license. A 12-line CC BY-SA snippet inlined into your proprietary backend is, on a strict reading, a license violation.

The agent does not know this. It treats Stack Overflow text as plain training data and paraphrases freely. Detection is harder than for npm packages because there is no manifest reference.

The pragmatic mitigations:

Block agent retrieval-augmentation against Stack Overflow. If your agent uses retrieval to ground responses, exclude the Stack Overflow domain from the retrieval index. This eliminates the "agent fetched a recent answer and pasted it" path.
Require attribution comments on agent-authored code blocks > 8 lines. Force the agent's prompt to include "if you derived this from a known source, cite it in a comment." Imperfect but catches some cases.
Run periodic structural similarity scans against a Stack Overflow code corpus. Snyk Code, Black Duck Snippet Matching, and Veracode all sell scanners with Stack Overflow corpora.

None of these are bulletproof. The honest answer is that Stack Overflow content vendored by agents is a known-unknown in most enterprise codebases, and the right time to start measuring it was last quarter.

Vulnerability Triage on Vendored Code

Once vendored components are in the SBOM, they integrate with the SBOM-to-CVE workflow. Run grype:

grype sbom:sbom-enriched.cdx.json --output json > grype.json

# Specifically look for CVEs against vendored components
jq '.matches[] | select(.artifact.metadata.properties[]?.name == "crashoverride:authored-by") | {cve: .vulnerability.id, package: .artifact.name, severity: .vulnerability.severity}' \
  grype.json

Vendored code inherits the CVE timeline of its source package. If [email protected] has a published CVE, your inlined paraphrase is presumptively vulnerable. The SBOM evidence record (AST similarity 0.87) is what justifies treating it as a hit rather than a false positive.

Operational Notes

Confidence scores matter. AST similarity above ~0.8 is strong evidence; below ~0.5 is noise. Tune thresholds per language.
Re-scan on every agent-touched commit. Agents introduce vendored code in small commits; a quarterly scan misses too much. Run incremental scans in CI.
Track the agent, not just the file. The CycloneDX properties field should record agent ID and model. When a model version is later flagged for a tendency to paraphrase from a specific source, you can query the SBOM corpus to find affected components.
Distinguish "vendored" from "regenerated." If an agent rewrites a function from scratch with no traceable source, that is original code. If it paraphrases a recognizable source, that is vendoring. The AST-similarity score is the dividing line.

What This Buys You

A scanner-flagged copyleft contamination, two years after the fact, is the worst-case outcome. It is also the default outcome for any team running coding agents without vendored-code SBOM enrichment.

The pipeline above — agent-file detection, AST scan, source-license resolution, SBOM injection, license-policy gate — turns the silent contamination problem into a CI failure on the PR that introduced it. The cost is one CI step. The benefit is that your SBOM stops lying about what is actually in your codebase.

Three Ways an Agent Vendors Code (and Why SBOMs Miss All Three)

What CycloneDX Lets You Do About It

Building the Detection Pipeline

Stage 1: Identify Agent-Authored Files

Stage 2: AST-Similarity Scan

Stage 3: Resolve Source License

Stage 4: Inject into the SBOM

License Policy Enforcement

The Stack Overflow Problem

Vulnerability Triage on Vendored Code

Operational Notes

What This Buys You

Sources

Software Compliance — Your last compliance vendor

Continue Reading

Generating a CycloneDX SBOM in Your CI/CD Pipeline

Cryptographically Signing AI-Generated Artifacts with Sigstore

Mapping Your SBOM to NIST NVD: A Vulnerability Triage Workflow