DEV AnythinG
  • Home
  • Categories
  • About
  • 한국어 English

Why You Shouldn’t Always Use dropDuplicates for Deduplication in PySpark

This article analyzes why dropDuplicates() is slow for deduplicating large PySpark datasets (due to its first aggregation over all columns and memory overhead), and shares an optimization case study d
2025-06-17
BigData > engineering
#pyspark #spark #data-engineering #big-data #performance #optimization #deduplication #row_number #dropDuplicates #window-functions

Relationship Safety Also Needs a Checkup

To prevent repeated dating violence and relationship-related tragedies,I built a simple test that helps anyone check for subtle warning signs in their romantic relationships.
2025-06-01
SeriesHub > fixground notes
#experiment #self-assessment #relationship #dating-violence #test #social-issue

Terraform, How a Small Data Team Survived

This post shares the lessons learned and structural decisions we made while adopting Terraform in a small data team. We also highlight key choices that helped us reduce errors and improve maintainabil
2025-04-20
DevOps > terraform
#terraform #iac #infra #data engineering #small team #terraform cloud

ChatGPT as a Data Analyst with DuckDB + S3

In this post, I’ll walk you through how I connected ChatGPT to my own dataset stored in S3 using DuckDB, and had the AI analyze it like a real data analyst.
2025-03-30
MachineLearning > experiment
#DeepLearning #LLM #ChatGPT #openai #FastAPI #DuckDB #S3 #AI

Build Your Own Data Warehouse with DuckDB + S3

Introducing an easy way to store data in AWS S3 and query it directly from your local machine using DuckDB.
2025-03-23
BigData > engineering
#duckdb #datawarehouse #data #s3 #aws
123

Search

Hexo Jess