Ingestion
Overview
The ingestion layer is responsible for collecting and normalizing data that feeds the dependency graph and contributor statistics. For v0, ingestion focuses on three primary streams:
- SBOM submissions – explicit dependency declarations from project repos (verification layer).
- Reference graph bootstrapping – automated crawling of public package registries and OpenGrants to build an initial graph from known Stellar/Soroban PG roots.
- Git contributor logs – for pony factor calculation (separate but parallel ingestion).
All ingestion writes at repo resolution. Project vertices are primarily sourced from OpenGrants; Repo vertices are created/updated by SBOM ingestion and registry crawls. Dependencies outside the Stellar ecosystem are stored as ExternalRepo vertices — tracked for blast radius analysis only, with no project-level data maintained.
The goal is rapid bootstrapping of a meaningful graph while encouraging accurate, ongoing SBOM contributions. All ingestion pipelines must be idempotent, validate inputs, and handle incremental updates without full reprocessing.
SBOM Ingestion
Source: GitHub Action workflow run by project teams on PR/merge to main (or tagged releases). Each SBOM submission is associated with a specific Repo, not a project directly.
Workflow:
- Teams add a lightweight GitHub Action to their workflows. The action fetches the repo’s SPDX 2.3 dependency graph from the GitHub Dependency Graph API and submits it to the PG Atlas ingestion endpoint, authenticated via a GitHub OIDC token. Supports both public and private repos.
- Optional: allow non-GitHub SBOM submissions which are signed with a project key for provenance (deferred for v0).
Authentication:
The action requests a short-lived GitHub OIDC token (RS256-signed JWT issued by GitHub’s OIDC provider) with the PG Atlas API URL as the audience, and sends it in the Authorization: Bearer header of the submission request. No secrets need to be configured in the calling repository — the only caller-side requirement is id-token: write in the workflow’s permissions block.
The API verifies the token by:
- Fetching GitHub’s public JWKS from
https://token.actions.githubusercontent.com/.well-known/jwks. - Verifying the RS256 signature and standard claims (
iss,exp,aud). - Extracting the
repositoryclaim (owner/repo) to establish which repo submitted the SBOM, and recording theactor(triggering user) for audit purposes.
Both GitHub-hosted and self-hosted runners are supported. The OIDC token in both cases is signed by GitHub’s OIDC provider and contains a runner_environment claim (github-hosted or self-hosted).
Trust model: The OIDC token cryptographically proves the identity of the submitting repo — it guarantees that the submission originated from a workflow running in the context of owner/repo, authorized by a GitHub user with write access. It does not independently verify the content of the submitted SBOM: a workflow author controls the workflow YAML and could in principle modify the payload before submission. The principal mitigations are: (1) the reference graph cross-check (A8) flags declared dependencies that diverge from the inferred graph; (2) all submissions are logged with the repository and actor claims, making falsification an attributable act; (3) community review and the public leaderboard create social accountability.
Processing:
- Validate SPDX 2.3 format and schema.
- Extract dependencies (package name + version range).
- Map each dependency to a
Repo(if within-ecosystem) orExternalRepo(if external). Normalize ecosystem-specific names (e.g.,soroban-sdkacross crates/npm) to matchcanonical_idformat (ecosystem:package). - Upsert the submitting
Repovertex. If its parentProjectdoesn’t exist, create it or flag for manual triage. - Create/update
depends_onedges from the submitting repo to each dependency (RepoorExternalRepo). Mark confidence asverified-sbom. - Flag conflicts with reference graph (e.g., missing declared deps) for manual review.
Incentives & Enforcement (v0):
- Soft: Bonus points in PG scoring for early/complete submissions.
- Planned: Tie to SCF Build testnet tranche release (preferred over mainnet to capture dependencies early).
Example workflow:
jobs:
sbom:
runs-on: ubuntu-latest
permissions:
contents: read # for GitHub Dependency Graph API
id-token: write # for OIDC authentication to PG Atlas
steps:
- uses: SCF-Public-Goods-Maintenance/pg-atlas-sbom-action@<full-commit-hash>
The api-url input defaults to the production PG Atlas endpoint and does not need to be set. The calling repo must have the GitHub dependency graph enabled.
Open Questions:
- Mandatory vs. optional for v0? (Risk: low uptake → sparse graph; mitigation: strong reference graph bootstrapping).
Reference Graph Bootstrapping
Purpose: Address low initial SBOM uptake by proactively building a “reference graph” from public metadata, starting from curated root nodes.
Sources:
- OpenGrants — primary source for
Projectvertices and their metadata (name, status, organization URL). - deps.dev gRPC API — cross-ecosystem dependency resolution for PyPI, npm, Cargo, Go, Maven, NuGet, and RubyGems packages.
- GitHub API — repository enumeration for organizations, release/tag discovery.
Architecture: Procrastinate task queue backed by the same hosted PostgreSQL instance (no separate broker), with workers running in a weekly GitHub Actions workflow. This provides free compute, built-in run history/logs, and higher GitHub API rate limits via GITHUB_TOKEN.
Task hierarchy:
sync_opengrants [opengrants queue]
└─ process_project [opengrants queue]
└─ crawl_github_repo [opengrants queue]
└─ crawl_package_deps [package-deps queue]
Workers run each queue sequentially so that all Repo vertices exist before the dependency crawl begins.
Process:
- Bootstrap Project vertices from OpenGrants:
sync_opengrantsfetches all SCF grant pools and their applications. Each application is mapped to anScfProjectcontaining the project ID, display name, GitHub URL (fromio.scf.codeextension field), activity status, and metadata.- A manual
project-git-mapping.ymlsupplements projects that lack anio.scf.codefield (early rounds). - Deduplication is by project ID; the latest round’s data wins.
- Populate
activity_statusfrom SCF Impact Survey data when available; default tonon-responsivefor pre-existing projects with no survey response (see Activity Status Update Logic). - Pre-survey data: we use tranche completion as a proxy (incomplete →
in-dev, complete →live).
- A manual
- Process each Project:
process_projectfetches deps.dev project metadata (stars, forks, scorecard) viaGetProjectBatch, determinesproject_type(public-goodif packages are detected,scf-projectotherwise), upserts theProjectvertex, and discovers repos.- For organization URLs (
github.com/org): enumerates all repos in the org. - For single-repo URLs (
github.com/owner/repo): uses that repo directly.
- For organization URLs (
- Crawl each repo:
crawl_github_repodetects packages published by the repo (via deps.devGetProjectBatch), fetches release/version history, upserts theRepovertex (withpkg:github/owner/repocanonical ID), and deferscrawl_package_depsfor each detected package. - Crawl dependencies:
crawl_package_depscalls deps.devGetPackage(default version) thenGetRequirementsto enumerate direct dependencies. For each dependency:- If linked to a known
Project→ upsert asRepo, createdepends_onedge (confidence = inferred_shadow), and recurse. - Otherwise → upsert as
ExternalRepo, create edge, no recursion.
- If linked to a known
Boundaries:
- Only include projects with clear Stellar/Soroban relevance — rooted in OpenGrants SCF data.
- Procrastinate
queueing_lockprevents duplicate task execution per project/package. - Respects registry rate limits; OpenGrants client retries on 429/5xx with exponential backoff.
Git Contributor Logs
Source: Direct git clone of target repositories (triggered on SBOM ingestion or manual curation). Cloned repos may be LRU-cached to avoid re-cloning on every refresh.
Process:
- Parse
git log --format='%aN' | sort | uniq -c | sort -nr(or equivalent) over the last 12–24 months. - Reuse patterns from Scientific Python devstats.
- Create/update
Contributorvertices andcontributed_toedges pointing to the Repo (not Project). Edge properties includenumber_of_commits,first_commit_date,last_commit_date. - Store computed pony factor on
Repo.pony_factor(number of contributors responsible for ≥50% of commits). Aggregate toProject.pony_factorby computing pony factor over the union of unique contributors across all project repos (deduplicated byContributor.email_hash). - Update
Repo.latest_commit_datefrom git log — feeds into activity status triangulation (see Activity Status Update Logic). - Update on triggers (new release tag, quarterly refresh).
Open Questions:
- Time window for pony factor (12/24 months vs. all history)?
- Weight recent commits higher?
Validation & Reconciliation
- On SBOM ingest: Compare declared deps against reference graph → flag discrepancies for review.
- Deduplication: Canonical node IDs (
ecosystem:packagefor repos, DAOIP-5 URIs for projects). - Ecosystem boundary: Determine whether each dependency is within-ecosystem (
Repo) or external (ExternalRepo). Criteria TBD — initial heuristic: presence in curated seed list or OpenGrants. - Error handling: Queue failed ingests for manual triage; notify team (via GitHub issue or Sentry?).
Implementation Notes (v0)
- Use FastAPI endpoint for SBOM webhook ingestion (
POST /ingest/sbom, OIDC auth, SPDX 2.3 parsing, 202 Accepted). Read-only list/detail endpoints:GET /ingest/sbom,GET /ingest/sbom/{id}. - Procrastinate task queue (PostgreSQL-backed) with GitHub Actions workers for reference graph bootstrapping and future periodic crawl jobs.
- deps.dev gRPC client (auto-generated via
betterproto2) for cross-ecosystem dependency resolution. - Store raw ingested artifacts (SBOM files, git log output) in
artifact_store/for auditability. We’re targeting Storacha as our decentralized artifact storage layer for the production Atlas. - All writes target
Repo,ExternalRepo,Contributor, and edge tables.Projectvertices are bootstrapped from OpenGrants and updated via survey/OpenGrants pipelines (see Incremental Updates in Storage).