2020 - Page 2 of 5 - MungingData

Scala Templates with Scalate, Mustache, and SSP

mrpowers October 30, 2020 0

The scalate library makes it easy to use Mustache or SSP templates with Scala. This blog post will show how to use Mustache and SSP templates and compares the different […]

Expressively Typed Spark Datasets with Frameless

mrpowers October 26, 2020 0

frameless is a great library for writing Datasets with expressive types. The library helps users write correct code with descriptive compile time errors instead of runtime errors with long stack […]

Write sqlite tables to CSV / Parquet files

mrpowers September 18, 2020 0

This blog post explains how to write sqlite tables to CSV and Parquet files. It’ll also show how to output SQL queries to CSV files. It’ll even show how to […]

Creating a sqlite database from CSVs with Python

mrpowers September 16, 2020 1

This blog post demonstrates how to build a sqlite database from CSV files. Python is perfect language for this task because it has great libraries for sqlite and CSV DataFrames. […]

Amazing Python Data Workflow with Poetry, Pandas, and Jupyter

mrpowers September 5, 2020 0

Poetry makes it easy to install Pandas and Jupyter to perform data analyses. Poetry is a robust dependency management system and makes it easy to make Python libraries accessible in […]

Writing Custom Metadata to Parquet Files and Columns with PyArrow

mrpowers August 28, 2020 0

Metadata can be written to Parquet files or columns. This blog post explains how to write Parquet files with metadata using PyArrow. Here are some powerful features that Parquet files […]

Analyzing Parquet Metadata and Statistics with PyArrow

mrpowers August 24, 2020 0

The PyArrow library makes it easy to read the metadata associated with a Parquet file. This blog post shows you how to create a Parquet file with PyArrow and review […]

Reading CSVs and Writing Parquet files with Dask

mrpowers August 23, 2020 0

Dask is a great technology for converting CSV files to the Parquet format. Pandas is good for converting a single CSV file to Parquet, but Dask is better when dealing […]

PySpark UDFs with Dictionary Arguments

mrpowers August 8, 2020 2

Passing a dictionary argument to a PySpark UDF is a powerful programming technique that’ll enable you to implement some complicated algorithms that scale. Broadcasting values and writing UDFs can be […]

Converting a PySpark DataFrame Column to a Python List

mrpowers July 28, 2020 2

There are several ways to convert a PySpark DataFrame column to a Python list, but some approaches are much slower / likely to error out with OutOfMemory exceptions than others! […]