Categorygithub.com/eternal-flame-AD/batch-vs-runner
repositorypackage
0.0.0-20201007215938-5c095b95967f
Repository: https://github.com/eternal-flame-ad/batch-vs-runner.git
Documentation: pkg.go.dev

# README

batch-vs-runner

Description

CLI tool for running virtual screening (or other batch processing on chemical structures) on multiple processors by creating batches and running them in parallel.

Usage of batch-vs-runner: batch-vs-runner [FLAGS] [SD|PDB|PDBQT|MOL2|DIRECTORY]...
  -batchEnd int
        end at Nth molecule, 0 means all molecules
  -batchSize int
        batch size (default 100)
  -batchStart int
        start from Nth molecule (cumulative across all input files) (default 1)
  -delay int
        delay a certain amount of time (in ms) between spawning the next process, useful for programs that periodically do heavy IO
  -enableSlurm
        detect slurm allocations based on environment variable and use srun to run jobs (default true)
  -exec string
        command to execute in worker (default "./job.sh")
  -lineBreak string
        linebreak for output structure: unix, dos, or mac (default "unix")
  -np int
        no. of worker processes (does not apply if slurm mode is in use) (default 1)
  -prefix string
        prefix on individual job work directory (default "job")
  -slurmNodeTaskOverride string
        override how many tasks to distribute to each node from the env received from slurm
  -verbose
        pass through worker script output to terminal
  -workspace string
        path to job setup files (can be a directory or single file) (default ".")
  -workspaceOnly
        generate workspace only but do not execute any job, you can use anything to execute the job once the workspace has been compiled

Get Started

  1. Create a folder as the "template" for each batch's workspace. During runtime, the program will automatically generate a workspace for each batch. You can put files that you want to copy to all workspaces (configuration files, batch scripts, etc.) here. Additionally, the batch of molecules for each job will be generated by the program automatically, named "job.sd", "job.sdf", "job.mol2" depending on input file extension. See Execution Environment part for details on how to write the template workspace.

  2. Execute "batch-vs-runner" with corresponding flags, format is Go standard lib style -key=value. Examples:

    • -workspace=path/to/my_workspace Workspace template is at path path/to/my_workspace
    • -np=20 20 parallel processes
    • -verbose=true pass through worker script output to terminal
    • -delay=1000 delay 1000ms before starting the next process during initialization.
    • -batchSize=10 override batch size to 10 molecules
    • -batchEnd=100 end at the 100th molecule (cumulative across all input files specified)

Full examples:

  • ./batch-vs-runner -np=30 -workspace=my_dock_job_template -batchSize=50 my_library.sdf Split my_library.sdf into 50-molecule batches and generate a workspace just like my_dock_job_template folder for each batch. Run job.sh in each batch with 30 parallel processes.
  • ./batch-vs-runner -workspace=my_dock_job_template -batchEnd=100 -batchSize=100 -workspaceOnly=true my_library.sdf Split my_library.sdf into 100-molecule batches, ending at the 100th molecule, and generate a workspace just like my_dock_job_template folder. Only generate workspace but do not execute job.sh. You can cd into the work directory and do whatever you want. Mainly used for testing and debugging.

Execution Environment

Template folder

  • Files will preserve their path relative to the template folder when they are compiled, so workspace/some_dir/file.txt will be copied to job_*_*/some_dir/file.txt upon execution. File modes will also be copied, exception is common executable files such as .sh .bash .run will be automatically added executable permission when they are compiled to the workspace.
  • Files with .tpl extension will be processed through Go text/template system, and they will be executed with . Context filled with the batch definition for each batch job. See example/gold/example.txt.tpl as an example.
  • A job.<ext> file will automatically be generated containing the molecules belonging to the batch. <ext> is mol2 sd sdf pdb pdbqt depending on input molecule format.
  • I recommend not leave empty folders in template directory. If you want to explicitly create an empty folder, use mkdir in job.sh or touch .keep > template/empty_dir

HPC environment with slurm

This program can automatically parse environment variables set by slurm and distribute jobs to the nodes allocated. (files won't be transferred automatically as of now, so must be run on a shared storage). No extra configuration needed.

To exilicitly disable this behavior (a.k.a.) do not use srun and run all job shell files on master node, use -enableSlurm=false.

Use flag -slurmNodeTaskOverride to override how many tasks to distribute to each node. Format is comma-separated list of numbers or numbers plus (xN) where N denotes the same configuration for N nodes.

job.sh file

The default command to execute for each batch job is bash -c ./job.sh. Thus, just add a script called job.sh in the template folder and it will be run automatically during runtime. Call your docking software in job.sh and ask it to dock file job.sdf, job.mol2 etc. depending on your input molecule type.

NOTE: The work directory for each batch script is the batch folder, so if you have a software.conf or software.conf.tpl in your job template, the correct way to refer to that file in job script is just software.conf or ./software.conf. If you want to override this bahavior, use cd in your job.sh

Examples

See examples/ folder for some example workspace templates.