βΆdbt vs traditional ETL (like Informatica) β what's the difference?
dbt is an ELT framework (Extract Load Transform) β data lands in your warehouse first, then dbt transforms it using SQL. Traditional ETL transforms before loading. dbt's advantage: leverage the warehouse's compute (Snowflake, BigQuery), version-control transformations via Git, test data quality, document lineage, and iterate quickly. Drawback: requires centralized data warehouse (not cheap for small datasets). Use dbt if you have Snowflake/BigQuery/Redshift + >1GB data. Use traditional ETL if moving raw files between on-prem systems.
βΆWhat's a 'model' in dbt and how does it differ from a view or table?
A dbt model = a SQL file that defines a transformation. When you run dbt, each model becomes either a view or table in your warehouse (you choose via materialization config). View = query runs every time (cheaper compute, slower), table = pre-computed and stored (expensive compute, fast queries). Incremental model = hybrid (update only new rows since last run). Use tables for frequently-queried fact tables, views for intermediate staging layers. `SELECT * FROM {{ ref('my_model') }}` pulls from upstream models.
βΆHow do I test data quality in dbt and prevent bad data from reaching downstream dashboards?
dbt includes built-in tests (unique, not_null, relationships, accepted_values) in schema.yml and custom SQL tests. Example: `- name: user_id\n tests:\n - unique\n - not_null`. Run `dbt test` to validate. Fail fast: block downstream models if tests fail (dbt's default). Advanced: write custom tests (e.g., 'conversion rate between 0 and 1'). Test at source level (raw data) and model level (transformed). Monitor test coverage: aim for >80% of columns tested.
βΆSnapshots and incremental models β when do I use each?
Incremental = transforms only NEW rows since last run (fast, cheap). Use for fact tables growing daily (events, transactions). Snapshot = captures a point-in-time view of a slowly-changing dimension (SCD). Example: snapshot user profile each day, track when 'plan' column changed from free to paid. Incremental fails if you need historical versions of a row; snapshot is designed for that. Rule of thumb: incremental for immutable events, snapshots for mutable dimensions.
βΆHow do I set up CI/CD for dbt in production (dbt Cloud vs self-hosted)?
dbt Cloud = managed solution, runs jobs on dbt's infrastructure, built-in scheduler, PR deployment previews (run jobs against staging branch). Cost ~$100-300/mo. Self-hosted (Airflow/Dagster) = you manage Python orchestrator, invoke dbt CLI in tasks, more control but more ops. Modern best practice: use dbt Cloud for small teams (<$10M ARR), self-hosted for large companies managing complex DAGs. dbt Cloud integration: commit to branch, dbt automatically runs `dbt test` + `dbt build` on preview branch before merging to main.
βΆWhat's the dbt semantic layer and how does it change analytics?
Semantic layer = centralized metric definitions (one source of truth for KPIs). Instead of each BI tool defining 'revenue' differently, dbt semantic layer defines it once (SELECT SUM(amount) WHERE status='paid'), and Looker/Tableau/Metabase pull from it. Benefit: consistency, fast iterations (change metric definition once, all dashboards update). Drawback: new workflow, requires dbt Cloud (not Core), learning curve. Use if: multiple BI tools or teams defining metrics differently. Skip if: small team, single BI tool.
βΆHow do I debug dbt when transformations produce wrong results?
Use `dbt debug` to check Warehouse connection. Use `dbt run --select my_model` to run one model. Check `target/compiled/` folder to see the actual SQL generated (Jinja templates expanded). Check `target/run_results.json` for execution times and test failures. Common bugs: missing WHERE clause (forgot to filter source data), wrong join logic (many-to-many join creating duplicates), stale dependencies (old ref() to wrong model). Always run `dbt test` after model changes. Use `dbt freshness` to check source data staleness.