▶Spark vs SQL warehouses (Snowflake/BigQuery) — when use which?
Spark: complex transformations, ML, streaming, open-source, lower cost at massive scale (petabytes). SQL warehouses: structured analytics, dashboards, simple SQL, managed infra, faster queries on GB-TB scale. Sweet spot: Spark for ETL/pipelines, warehouse for BI. Many orgs use both: Spark for prep, warehouse for serving.
▶Databricks vs DIY Spark (EMR/Dataproc) — what's the real cost difference?
Databricks: managed, Delta Lake native, ~2-3x opex of raw cloud but includes cluster mgmt/optimization/notebooks. EMR/Dataproc: cheaper per compute-hour but ops overhead, DevOps time, monitoring. For teams <5: Databricks saves money. For teams >15 with Kubernetes expertise: DIY wins if you automate. 2026 trend: Databricks serverless (Compute) removes cluster ops entirely.
▶When NOT to use Spark — when is it overkill?
Spark has overhead: cluster startup (2-5 min), memory footprint (4GB+ minimum). Avoid: <10GB datasets (Pandas faster), real-time <100ms latency (use Kafka/Flink), simple SQL (use warehouse), batch <1GB (local machine). Spark shines: >1TB, complex multi-stage, ML training, streaming millions/sec, diverse data sources.
▶PySpark vs Scala Spark — which should I learn?
PySpark: easier to learn, richer ML ecosystem (scikit-learn, TensorFlow), preferred by data scientists, slower UDFs. Scala: native, best performance, used by Netflix/Uber backends, steeper learning curve. Start PySpark. If you hit performance walls on UDFs, learn Scala for critical paths. Most jobs ask for PySpark.
▶Spark 4 changes — what broke and do I care?
Spark 3.5→4.0: PySpark Type Hints (opt-in), ANSI compliance stricter, minor API changes. Your Spark 3 code runs fine. Benefits: 30-50% faster on TPC-DS, better Kubernetes, serverless improvements. Upgrade gradually. New projects: use 4.0 directly.
▶How do I optimize Spark costs and performance?
Partition smartly (avoid >10k tiny partitions), use broadcast for <100MB joins, cache only hot DataFrames, tune shuffle (spark.shuffle.partitions), prefer Parquet/Delta over CSV, use columnar filters, right-size executors (don't oversell). Databricks Photon: 10-50x speedup on SQL at no extra cost. Monitor with Spark UI (stages, executors, shuffle).
▶Spark for ML — when to use vs scikit-learn or TensorFlow?
MLlib (Spark's ML library): distributed training on petabytes, good for classification/regression/clustering at scale. Downsides: slower than TensorFlow/PyTorch, limited deep learning. Use MLlib: >1TB data, distributed CPU training. Use TensorFlow: deep learning, GPUs, <1TB. Hybrid: PySpark for prep, TensorFlow for training on sampled data, score back in Spark.