Hard-won Research Workflow Tips
π Number Seeds Starting at 1,000
Ensures alphabetical ordering of string representation matches numerical ordering. This tip credit @voidptr
π Have low-numbered seeds run quick, pro-forma versions of your workflow
Good for debugging. Your workflow wonβt work the first or second time. The first time you run your workflow is NEVER to generate actual data. It is to debug the workflow.
Spot check all data files manually.
π Divide seeds into completely-independent βendeavorsβ
No data collation between endeavors. Allows completely independent instantiations of the same workflow.
π Do you have all the data columns youβll want to join on?
if you gzip, redundant or always-same columns shouldnβt take up much extra space
π How will you manually re-start or intervene in the middle of your workflow when things go wrong?
π How will you weed out bad or corrupted data from good data?
Ideally, different runs will be completely isolated so this is not necessary
π How will you play detective when you find something weird?
π Use JSON, YAML, etc.
donβt invent your own serialization format
π Save SLURM job ID and hard-pin job ID with all data
π Save SLURM logs to one folder, named with the JOB ID
Consider using symlinks
π Also save a copy of SLURM scripts to a separate folder, named with the JOB ID
This makes it easy to debug, rerun, or make small manual changes and rerun.
π A 1% or 0.1% failure rate == major headache at scale
Request very generous extra time on jobs or, better yet, use time-aware experimental conditions (i.e., run for 3 hours instead of n updates)
π Use Jinja to instantiate batches of job scripts
π Hard-pin your source as a Jinja variable
Subsequent jobs should inherit parentβs hard pin
π Embed large numbers of jobs into a script as a text variable representation of .tar.gz
data
π Re-clone and re-compile source for every job
Use automatic caching instead of manual caching
ccache
- https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/gitget.sh
π Get push notifications of job failures
Automatically upload to logs to transfer.sh
to be able to view as embedded link
Running experiments equals pager duty.
π Get push notifications of your running job count
π Have every job check file quota on startup and send push warning if quota is nearing capacity
π Have all failed jobs try rerunning themselves once
π If it happens once (i.e., not in a loop), log it
π Log full configuration and save an asconfigured.cfg
file
π Log ALL Bash variables
π Log RNG state
π Timestamp log entries
π Put everything that might fail (i.e., downloads) in an automatic retry loop
π Log digests and row count of all data read in from file
in notebooks and in simulation
π gzip log files and save them, named by SLURM job ID
Your data has SLURM job id, so it is easy to find relevant log when questions come up
π You can open URLs directly inside pandas.read_csv()
i.e., OSF urls
π You can open .csv.gz
, .csv.xz
directly inside pandas.read_csv()
Consider using a library to save to gz
filehandle directly.
π Notebooks
- commit bare notebooks
- consider adding a git commit hook or CI script to enforce this
- run notebooks inside GitHub actions
- consistent, known environment
- enforces EVERY asset MUST be committed or downloaded from URL
- keeps generated cruft off of main branch history
- use
teeplot
to generate and save copies of all plots with descriptive, systematic filenames - github actions pushes new executed notebooks, generated assets to a
binder
ornotebooks
branch - be sure to push so that history on that branch is preserved
- include graphics in LaTeX source as git submodule
- get new versions of graphics as
git pull
- be sure to use
--recursive --depth=1 --jobs=n
while cloning
- get new versions of graphics as
π Overleaf doesnβt like git submodules
AWS Cloud9 is a passable alternative, but itβs hard to get collaborator buy-in and annoying to manage logins and permissions
π Use symlinks to access your in-repository Python library
π Create supplementary materials as a LaTeX appendix to your paper
Makes moving content in/out of supplement trivial.
You can use pdftk dump_data_utf8
to automatically split the supplement off into a separate document as the last step of the build.
π Everything MUST print a stack trace WHEN it crashes
https://github.com/bombela/backward-cpp
Consider adding similar features to your job script with a bash error trap https://unix.stackexchange.com/a/504829
π Be sure to log node name on crash
Some nodes are bad. You can exclude them from letting your jobs run on them.
π Use bump2version
to release software
https://github.com/c4urself/bump2version
π Use pip-tools
to ensure a consistent Python environment
https://pypi.org/project/pip-tools/
π Donβt upload to the OSF directly from HPCC jobs
Their service isnβt robust enough and youβll cause them issues. AWS S3 is an easy alternative that integrates well.
π Make it clear what can be deleted or just delete immediately
Otherwise you will end up with a bunch of stuff and not remember whatβs important
Put things in /tmp
so theyβre deleted automatically.
π ondemand
https://ondemand.hpcc.msu.edu/
Great for browsing, downloading files. Also for viewing queue and canceling individual jobs.