Changing the job queue from default to urgent

I was wondering which files of the workflow I should update to change the job queue from default to urgent. Thanks

You can set that in configure.sh (e.g., QUEUE_FCST="urgent") or change the variable in the file

ufs-srweather-app/regional_workflow/ush/setup.sh

 

Hi Johana,

The workflow has the following three variables that you can set in your experiment configuration file (config.sh):

QUEUE_DEFAULT

QUEUE_HPSS

QUEUE_FCST

QUEUE_DEFAULT applies to the make_grid, make_orog, make_sfc_climo, make_ics, make_lbcs, and the various run_post_f... tasks; QUEUE_HPSS applies to the get_extrn_ics and get_extrn_lbcs tasks; and QUEUE_FCST applies only to the run_fcst task.  The default values of these variables depend on the machine you're on.  For example, on Hera (which uses the slurm job scheduler), they are set as follows:

    QUEUE_DEFAULT=${QUEUE_DEFAULT:-"batch"}
    QUEUE_HPSS=${QUEUE_HPSS:-"batch"}
    QUEUE_FCST=${QUEUE_FCST:-"batch"}

So you can see that if you haven't specified these in your config.sh file, all three will default to "batch".  Related to these are the slurm partitions in which the jobs are run.  For Hera, these are set as follows:

    PARTITION_DEFAULT=${PARTITION_DEFAULT:-"hera"}
    PARTITION_HPSS=${PARTITION_HPSS:-"service"}
    PARTITION_FCST=${PARTITION_FCST:-"hera"}

So by default, all the rocoto tasks except get_extrn_ics and get_extrn_lbcs run in the partition named hera, while the latter two run in the service partition.  The PARTITION_... variables are also configurable in the config.sh file.  The QUEUE_... variables specify the QOS's under these respective partitions.  On Hera, a list of the QOS's allowed under each partition is given here

Note that if you change one or more of the QUEUE_... variables, you may also have to change the wallclock, maxtries, and other task resources in config.sh using the appropriate workflow variables, e.g. for the make_ics task:

NNODES_MAKE_ICS
PPN_MAKE_ICS
WTIME_MAKE_ICS
MAXTRIES_MAKE_ICS

This may have to be done, for example, if you change QUEUE_... from the batch to the debug QOS, and the default wall clock time for the task is greater than 30 minutes (since max limit in the debug QOS is 30 mins).

If you've already created your experiment (and so you already have the FV3LAM_wflow.xml rocoto workflow file), if you find that a task/job is sitting in the queue too long, you can cancel it (on Hera, use scancel), change QUEUE_DEFAULT, QUEUE_FCST, or QUEUE_HPSS (whichever is the one that applies to the task) in FV3LAM_wflow.xml to "debug" (I haven't tried "urgent"), and relaunch the workflow either using the launch_FV3LAM_wflow.sh script in your experiment directory or directly using a rocotorun or rocotoboot command.  As above, you may also have to change the wallclock, maxtries, and other task resources in FV3LAM_wflow.xml (note that if you cancel a job, that counts as one (failed) try, so if maxtries was set to 1 in FV3LAM_wflow.xml, you'll need to increase it; otherwise, the job scheduler will reject the job).