FAQ
Questions that came up while actually onboarding an app onto the platform, grouped by topic. This is a first pass — expect it to grow and get corrected as more of the tutorial gets exercised for real.
Platform concepts
What is Workload Identity, in plain terms?
It lets a pod authenticate as a GCP service account without carrying any key or password. The pod's Kubernetes ServiceAccount is bound (via an IAM policy) to a GCP service account; GCP recognizes the pod on the spot and issues short-lived credentials. Nothing to leak, nothing to rotate.
What's the difference between chart/values.yaml and rootsync.yaml in u2i-infra?
chart/values.yaml holds the base defaults for every app (mostly disabled/empty — the safe starting point). k8s/{env}/rootsync.yaml holds the real, environment-specific overrides (actual namespaces, actual Workload Identity bindings, the real DNS zone). Helm deep-merges the two — you only need to write the deltas in rootsync.yaml, not repeat everything.
Is rootsync.yaml just a values-override file?
Not primarily. It's a real Kubernetes resource (kind: RootSync) that Config Sync continuously reconciles — its job is to tell the cluster which OCI chart to pull and keep re-checking (every ~15s) that the cluster matches it. The big "per-app overrides" block is the content Config Sync is asked to enforce, not the resource's whole purpose.
What is build-lib?
A shared Cloud Build engine: a Docker image bundling the CLIs every pipeline needs (gcloud, kubectl, helm, buildx, crane) plus a Nushell library of typed commands (build image, release package-for-config-sync, kube ensure-namespace, …). It turns each app's deploy/cloudbuild/*.yaml into a short list of named steps instead of hand-rolled bash. See the build-lib page.
eg vs managed-lb — what's the difference?
eg means the app attaches to the shared, in-cluster Envoy Gateway (cluster-edge/cluster-gateway) via its own ListenerSet — no dedicated infrastructure, cert-manager handles TLS. managed-lb means a real Google Cloud Load Balancer with its own static IP and a Certificate Manager cert-map. The nonprod cluster only supports eg (no in-cluster gateway exists on prod), so prod uses managed-lb.
What does KEDA scale-to-zero actually do, and when should it be off?
In nonprod, an idle environment scales down to 0 pods and wakes on the first request (a brief splash screen covers the cold start) — it costs nothing while unused. It should be off in prod for anything that must respond instantly with no cold start, and because managed-lb can't render the splash screen (a wake-from-zero would show a raw error instead).
Domain or subdomain — which is it?
u2i.dev is the root domain. {app}.u2i.dev is a subdomain of it, delegated to its own DNS zone that the app owns. Environment hostnames (dev.{app}.u2i.dev, qa.{app}.u2i.dev) are subdomains of that subdomain.
Naming & identifiers
Is there a length limit on the app key?
Yes — 20 characters. Foundation generates a {key}-deploy-sa service account, and GCP caps service-account IDs at 30 characters, so 30 minus "-deploy-sa" (10 chars) leaves 20 for the key. A key of 21+ characters fails Terraform apply with an "account_id must be between 6 and 30 characters" error.
Does the app key have to match the GitHub repo name?
No — github_repo is a separate field in the foundation app entry. The key can (and often should) be shorter; it becomes namespaces, registry names, service accounts, and the hostname base, so keep it short and DNS-safe even if the repo name is longer.
CI/CD & deployment flow
Does opening a PR against the foundation repo apply anything?
No. The PR check (pr-terraform-validation) only runs terraform plan and posts the diff — it never applies. The actual apply happens after merge, via a second Cloud Build trigger that creates a Cloud Deploy release on a delivery pipeline gated behind manual approval.
Does the same hold for u2i-infra (the Config Sync repo)?
No — and this was a real, confirmed problem, not a hypothetical: u2i-infra's PR-validation trigger runs the exact same cloudbuild.yaml as the push-to-main trigger, with no branch/event gating. Opening a PR there packages the chart, re-tags the shared "latest" OCI tag, and creates a real Cloud Deploy release that auto-applies to nonprod — before any review or merge. Confirmed end to end: build steps all succeeded, the Cloud Deploy rollout succeeded, and the real GCP resources existed on the live cluster before the PR was merged. Filed as a critical issue.
A brand-new app's first foundation apply failed on the preview trigger with an "impersonation permission" error — is that a real problem?
Usually not — it's IAM propagation lag. The {app}-ci service account was created moments earlier in the same apply and hasn't fully propagated when the next resource tries to reference it. Everything else in the apply already succeeded; retrying (a fresh release, or gcloud deploy rollouts retry-job) completes it. Retrying needs the clouddeploy.operator role, not just approver.
In what order should a new app actually be onboarded?
Foundation registration first (creates CI/CD scaffolding) → tenant infra (namespaces, Workload Identity, and the app's own DNS subzone) → root DNS delegation (reads the nameservers from that now-existing subzone) → first deploy. "Start with DNS" is a natural-sounding but wrong first instinct — the delegation step depends on the subzone existing, so it has to come after tenant infra, not before.
DNS
Does registering an app in foundation give it a working hostname automatically?
No. Foundation registration creates zero DNS resources. A hostname requires an explicit dns.publicZone entry in the app's tenant-infra values (which creates the subzone) and a root-zone delegation pointing at it — both are separate, deliberate steps.
Why does the root DNS delegation read the nameservers from a live zone instead of hardcoding them?
Cloud DNS assigns a random nameserver pool when a zone is created. A hardcoded pool goes stale if the zone is ever destroyed and recreated — this caused two real drift incidents on the platform before the fix. Reading via a Terraform data source makes that class of drift impossible. (Two older app entries still hardcode it and haven't been migrated yet.)
Do I need DNS working to verify a new app actually runs?
No. kubectl port-forward reaches the pod directly, with no DNS or gateway involved. DNS/hostname setup is about public reachability, which is a separate, later concern from "does the pod come up."
Conventions we found across apps (not all consistent)
Is a shared 'gateway' always required, or can an app skip it?
An app that rides the shared nonprod Envoy gateway through its own workload chart (gatewayMode: eg) does not need a dedicated Gateway from the shared tenant-infra chart — set gateway.enabled: false there. This matches the current, most-recently-added standalone app's pattern and a real production app's actual gatewayMode split (eg for dev/qa/preview, managed-lb only for prod). Not every existing app follows this consistently.
Two apps that share a DNS zone/gateway (a "family") — is there one standard way to do it?
No — we found two different implementations for the same concept: one family uses a dedicated, separate shared subchart owning the DNS zone and gateway; another has its first app own the shared zone/gateway directly, with the second app riding on it. Same idea, two shapes.
Should a vendored shared-chart dependency (the resolved .tgz) be committed to git?
It depends on the repo, and the platform is inconsistent about it. Config Sync repos mostly commit the resolved tarball (for hermetic reconciliation without a network fetch every ~15s), while application repos mostly gitignore it and rely on Chart.lock + CI running a dependency build. Match whichever convention the specific repo you're in already uses — check its .gitignore and what similar, recently-touched entries do, rather than assuming one universal rule.