Managing Dask Software Environments with Conda
This post shows you how to set up conda on your machine and explains why it’s the best way to manage software environments for Dask projects.
This blog post says that Python projects should be set up with pyenv and Poetry in 2021. The arguments apply to web based Python projects, but not data science projects that often leverage low level dependencies like cuda and xgboost that are notoriously hard to install. Conda is the best way to manage complicated data science software environments.
This post will cover the following topics:
- Install Miniconda
- Install dependencies in base environment
- Create separate software environments
- Useful conda commands
- Difference between conda and conda-forge
Most data scientists have a hard time managing software environments and debugging issues. They absolutely hate the trial and error process that’s required to get a local environment properly set up.
Pay attention to this post closely so you can better understand the process and train yourself how to effectively debug environment issues.
Install Miniconda
Go to the conda page to download the installer that suits your machine. There are a plethora of options on the page. It’s easiest to pick the Latest Miniconda Installer Link for your operating system.
I am using a Mac, so I use the Miniconda3 MaxOSX 64-bit pkg link.
Open the downloaded package file and it’ll walk you through the installation process.
Close out your Terminal window, reopen it, and you should be ready to run conda commands. Make sure the conda version
runs in your Terminal to verify the installation completed successfully.
Install dependencies in base
The default conda environment is called “base”.
You can run conda list
to see the libraries that are installed in base.
# packages in environment at /Users/powers/opt/miniconda3:
#
# Name Version Build Channel
brotlipy 0.7.0 py39h9ed2024_1003
ca-certificates 2021.7.5 hecd8cb5_1
certifi 2021.5.30 py39hecd8cb5_0
cffi 1.14.6 py39h2125817_0
chardet 4.0.0 py39hecd8cb5_1003
conda 4.10.3 py39hecd8cb5_0
conda-package-handling 1.7.3 py39h9ed2024_1
cryptography 3.4.7 py39h2fd3fbb_0
idna 2.10 pyhd3eb1b0_0
libcxx 10.0.0 1
libffi 3.3 hb1e8313_2
ncurses 6.2 h0a44026_1
openssl 1.1.1k h9ed2024_0
pip 21.1.3 py39hecd8cb5_0
pycosat 0.6.3 py39h9ed2024_0
pycparser 2.20 py_2
pyopenssl 20.0.1 pyhd3eb1b0_1
pysocks 1.7.1 py39hecd8cb5_0
python 3.9.5 h88f2d9e_3
python.app 3 py39h9ed2024_0
readline 8.1 h9ed2024_0
requests 2.25.1 pyhd3eb1b0_0
ruamel_yaml 0.15.100 py39h9ed2024_0
setuptools 52.0.0 py39hecd8cb5_0
six 1.16.0 pyhd3eb1b0_0
sqlite 3.36.0 hce871da_0
tk 8.6.10 hb0a8c7a_0
tqdm 4.61.2 pyhd3eb1b0_1
tzdata 2021a h52ac0ba_0
urllib3 1.26.6 pyhd3eb1b0_1
wheel 0.36.2 pyhd3eb1b0_0
xz 5.2.5 h1de35cc_0
yaml 0.2.5 haf1e3a3_0
zlib 1.2.11 h1de35cc_3
Run conda install -c conda-forge dask
to install Dask in the base environment.
This will install Dask and all of its transitive dependencies.
Run conda list
again and you’ll see a ton of new dependencies in the environment, including pandas and Dask.
Dask depends on other libraries, so when you install Dask conda will install both the Dask source code and the source code of all the libraries that Dask depends on (aka transitive dependencies).
Difference between conda and conda-forge
Conda hosts 720+ official packages. Community contributed packages are stored in conda-forge.
conda-forge is referred to as a “channel”.
Let’s inspect the Dask installation command we ran earlier: conda install -c conda-forge dask
The -c conda-forge
part of the command is instructing conda to fetch the Dask dependency from the conda-forge channel.
Create separate software environments
You can specify a list of dependencies in a YAML file and run a command to create a software environment with all of those dependencies (and their transitive dependencies).
This workflow is more complicated, but easier to maintain in the long run and more reliable. It also allows your teammates to easily recreate your environment, which is key for collaboration.
You’re likely to have different projects with different sets of dependencies on your computer. Multiple environments allow you to switch the dependencies for the different projects you’re working on.
Take a look at the following YAML file with conda dependencies:
name: standard-coiled
channels:
- conda-forge
- defaults
dependencies:
- python=3.9
- pandas
- dask[complete]
- pyarrow
- jupyter
- ipykernel
- s3fs
- coiled
- python-blosc
- lz4
- nb_conda
- jupyterlab
- dask-labextension
We can run a single command to create the standard-coiled environment specified in the YAML file. Clone the coiled-resource repository and change into the project root directory to run this command on your machine.
conda env create -f envs/standard-coiled.yml
You can activate this environment and use all the software you just installed by running conda activate standard-coiled
.
This is a completely different environment than the base
environment. You can switch back and forth between the two environments to easily switch between the different sets of dependencies.
Useful conda commands
You can run conda env list to see all the environments on your machine.
# conda environments:
#
base /Users/powers/opt/miniconda3
standard-coiled * /Users/powers/opt/miniconda3/envs/standard-coiled
The star next to standard-coiled
means it’s the active environment.
Change back to the base environment with conda activate base
.
Delete the standard-coiled environment with conda env remove --name standard-coiled
.
If an environment gets in a weird state, you can easily delete it and recreate it from the YAML file. Easily recreating environments is a big advantage of creating environments from YAML files. YAML files can also be referred to in the future as a reminder of how the environment was originally created.
Dependency hell
It takes a while for conda to create an environment because it needs to perform dependency resolution and download the source code for the right libraries on your machine.
Dependency resolution is when conda figures out the set of dependency versions that’ll satisfy the version requirement of each dependency / transitive dependency for the environment.
Dependency hell is an uncomfortable situation when the dependencies cannot be resolved. Luckily conda is a mature technology and is good at resolving the dependencies whenever possible, thus saving you from dependency hell.
There are times when conda won’t be able to resolve a build. That’s when you should try relaxing version constraints or installing all packages at the same time. Conda has a harder time correctly working out the dependencies when they’re installed one-by-one on the command line. It’s best to put all the dependencies in a YAML file and install them all at once, so conda can perform the full dependency resolution process.
Apple M1 Chip gotcha
If you are using an Apple computer with a M1 chip, you may want to try mambaforge instead of Miniconda. See this blog post on using Conda with Mac M1 machines for more details.
Type in conda info
and look at the results.
active environment : base
active env location : /Users/powers/opt/miniconda3
shell level : 1
user config file : /Users/powers/.condarc
populated config files :
conda version : 4.10.3
conda-build version : not installed
python version : 3.9.5.final.0
virtual packages : __osx=10.16=0
__unix=0=0
__archspec=1=x86_64
base environment : /Users/powers/opt/miniconda3 (writable)
conda av data dir : /Users/powers/opt/miniconda3/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/osx-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/osx-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /Users/powers/opt/miniconda3/pkgs
/Users/powers/.conda/pkgs
envs directories : /Users/powers/opt/miniconda3/envs
/Users/powers/.conda/envs
platform : osx-64
user-agent : conda/4.10.3 requests/2.25.1 CPython/3.9.5 Darwin/20.3.0 OSX/10.16
UID:GID : 501:20
netrc file : None
offline mode : False
The platform is osx-64
, which is not optimized for Mac M1 chips. The libraries downloaded with this setup are run through an emulator, which can cause a performance drag.
Conclusion
Conda is a great package manager for Python data science because it can handle the dependencies that are difficult to install like xgboost and cuda.
Data science libraries like xgboost contain a lot of C++ code and that’s why they’re hard for package managers to handle properly.
Python has other package managers, like Poetry, that work fine for simple Python builds that rely on pure Python libraries. Data scientists don’t have the luxury of working with pure Python libraries, so Poetry isn’t a good option for data scientists. Conda is the best option for Python data workflows.
This post taught you how to install conda, run basic commands, and manage multiple software environments. Keep practicing till you’ve memorized all the commands in this post and conda comes naturally for you. Installing software is a common data science pain point and it’s worth investing time studying the basics, so you’re able to debug complex installation issues.
Now is a good time to read the post on the Dask JupyterLab workflow that’ll teach you a great conda-powered Dask development workflow.