πŸ”— Number Seeds Starting at 1,000

Ensures alphabetical ordering of string representation matches numerical ordering. This tip credit @voidptr

πŸ”— Have low-numbered seeds run quick, pro-forma versions of your workflow

Good for debugging. Your workflow won’t work the first or second time. The first time you run your workflow is NEVER to generate actual data. It is to debug the workflow.

Spot check all data files manually.

πŸ”— Divide seeds into completely-independent β€œendeavors”

No data collation between endeavors. Allows completely independent instantiations of the same workflow.

πŸ”— Do you have all the data columns you’ll want to join on?

if you gzip, redundant or always-same columns shouldn’t take up much extra space

πŸ”— How will you manually re-start or intervene in the middle of your workflow when things go wrong?

πŸ”— How will you weed out bad or corrupted data from good data?

Ideally, different runs will be completely isolated so this is not necessary

πŸ”— How will you play detective when you find something weird?

πŸ”— Use JSON, YAML, etc.

don’t invent your own serialization format

πŸ”— Save SLURM job ID and hard-pin job ID with all data

πŸ”— Save SLURM logs to one folder, named with the JOB ID

Consider using symlinks

πŸ”— Also save a copy of SLURM scripts to a separate folder, named with the JOB ID

This makes it easy to debug, rerun, or make small manual changes and rerun.

πŸ”— A 1% or 0.1% failure rate == major headache at scale

Request very generous extra time on jobs or, better yet, use time-aware experimental conditions (i.e., run for 3 hours instead of n updates)

πŸ”— Use Jinja to instantiate batches of job scripts

πŸ”— Hard-pin your source as a Jinja variable

Subsequent jobs should inherit parent’s hard pin

πŸ”— Embed large numbers of jobs into a script as a text variable representation of .tar.gz data

https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/slurm_stoker_kickoff.sh

https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/slurm_stoker.slurm.sh.jinja

πŸ”— Re-clone and re-compile source for every job

Use automatic caching instead of manual caching

πŸ”— Get push notifications of job failures

Automatically upload to logs to transfer.sh to be able to view as embedded link

Running experiments equals pager duty.

πŸ”— Get push notifications of your running job count

πŸ”— Have every job check file quota on startup and send push warning if quota is nearing capacity

https://github.com/mmore500/dishtiny/blob/bdac574424570fb7dc85ef01ba97464c6a8737cc/script/check_quota.sh

πŸ”— Have all failed jobs try rerunning themselves once

πŸ”— If it happens once (i.e., not in a loop), log it

πŸ”— Log full configuration and save an asconfigured.cfg file

πŸ”— Log ALL Bash variables

πŸ”— Log RNG state

πŸ”— Timestamp log entries

πŸ”— Put everything that might fail (i.e., downloads) in an automatic retry loop

πŸ”— Log digests and row count of all data read in from file

in notebooks and in simulation

πŸ”— gzip log files and save them, named by SLURM job ID

Your data has SLURM job id, so it is easy to find relevant log when questions come up

πŸ”— You can open URLs directly inside pandas.read_csv()

i.e., OSF urls

πŸ”— You can open .csv.gz, .csv.xz directly inside pandas.read_csv()

Consider using a library to save to gz filehandle directly.

πŸ”— Notebooks

  • commit bare notebooks
    • consider adding a git commit hook or CI script to enforce this
  • run notebooks inside GitHub actions
    • consistent, known environment
    • enforces EVERY asset MUST be committed or downloaded from URL
    • keeps generated cruft off of main branch history
    • use teeplot to generate and save copies of all plots with descriptive, systematic filenames
    • github actions pushes new executed notebooks, generated assets to a binder or notebooks branch
    • be sure to push so that history on that branch is preserved
  • include graphics in LaTeX source as git submodule
    • get new versions of graphics as git pull
    • be sure to use --recursive --depth=1 --jobs=n while cloning

πŸ”— Overleaf doesn’t like git submodules

AWS Cloud9 is a passable alternative, but it’s hard to get collaborator buy-in and annoying to manage logins and permissions

πŸ”— Create supplementary materials as a LaTeX appendix to your paper

Makes moving content in/out of supplement trivial. You can use pdftk dump_data_utf8 to automatically split the supplement off into a separate document as the last step of the build.

πŸ”— Everything MUST print a stack trace WHEN it crashes

https://github.com/bombela/backward-cpp

Consider adding similar features to your job script with a bash error trap https://unix.stackexchange.com/a/504829

πŸ”— Be sure to log node name on crash

Some nodes are bad. You can exclude them from letting your jobs run on them.

πŸ”— Use bump2version to release software

https://github.com/c4urself/bump2version

πŸ”— Use pip-tools to ensure a consistent Python environment

https://pypi.org/project/pip-tools/

πŸ”— Don’t upload to the OSF directly from HPCC jobs

Their service isn’t robust enough and you’ll cause them issues. AWS S3 is an easy alternative that integrates well.

πŸ”— Make it clear what can be deleted or just delete immediately

Otherwise you will end up with a bunch of stuff and not remember what’s important

Put things in /tmp so they’re deleted automatically.

πŸ”— ondemand

https://ondemand.hpcc.msu.edu/

Great for browsing, downloading files. Also for viewing queue and canceling individual jobs.