Why You Shouldn’t Always Use dropDuplicates for Deduplication in PySpark
This article analyzes why dropDuplicates() is slow for deduplicating large PySpark datasets (due to its first aggregation over all columns and memory overhead), and shares an optimization case study d