mrpowers, Author at MungingData

Write sqlite tables to CSV / Parquet files

mrpowers September 18, 2020 0

This blog post explains how to write sqlite tables to CSV and Parquet files. It’ll also show how to output SQL queries to CSV files. It’ll even show how to […]

Creating a sqlite database from CSVs with Python

mrpowers September 16, 2020 1

This blog post demonstrates how to build a sqlite database from CSV files. Python is perfect language for this task because it has great libraries for sqlite and CSV DataFrames. […]

Amazing Python Data Workflow with Poetry, Pandas, and Jupyter

mrpowers September 5, 2020 0

Poetry makes it easy to install Pandas and Jupyter to perform data analyses. Poetry is a robust dependency management system and makes it easy to make Python libraries accessible in […]

Writing Custom Metadata to Parquet Files and Columns with PyArrow

mrpowers August 28, 2020 0

Metadata can be written to Parquet files or columns. This blog post explains how to write Parquet files with metadata using PyArrow. Here are some powerful features that Parquet files […]

Analyzing Parquet Metadata and Statistics with PyArrow

mrpowers August 24, 2020 0

The PyArrow library makes it easy to read the metadata associated with a Parquet file. This blog post shows you how to create a Parquet file with PyArrow and review […]

Reading CSVs and Writing Parquet files with Dask

mrpowers August 23, 2020 0

Dask is a great technology for converting CSV files to the Parquet format. Pandas is good for converting a single CSV file to Parquet, but Dask is better when dealing […]

PySpark UDFs with Dictionary Arguments

mrpowers August 8, 2020 2

Passing a dictionary argument to a PySpark UDF is a powerful programming technique that’ll enable you to implement some complicated algorithms that scale. Broadcasting values and writing UDFs can be […]

Converting a PySpark DataFrame Column to a Python List

mrpowers July 28, 2020 2

There are several ways to convert a PySpark DataFrame column to a Python list, but some approaches are much slower / likely to error out with OutOfMemory exceptions than others! […]

Fetching Random Values from PySpark Arrays / Columns

mrpowers July 26, 2020 0

This post shows you how to fetch a random value from a PySpark array or from a set of columns. It’ll also show you how to add a column to […]

Building DAGs / Directed Acyclic Graphs with Python

mrpowers July 25, 2020 0

Directed Acyclic Graphs (DAGs) are a critical data structure for data science / data engineering workflows. DAGs are used extensively by popular projects like Apache Airflow and Apache Spark. This […]