Anyone have suggestions for cron substitute on cheyenne?

Hi-

I spent the weekend trying to get a job to finish on cheyenne because I don't have cron access. I have tried various solutions for repeatedly calling the launch command, such as long command-line sequences with sleep calls, and bash and python scripts that sleep between launch calls, but I seem to get logged off cheyenne after a short time. Even commands I run in the background seem to get terminated. Then my SRW jobs time out due to wtime limits. I'm sure I'm missing something obvious here, but does anyone have any first-hand experience reliably maintaining the workflow on cheyenne without access to cron? I'm looking for best-practice suggestions so that I don't have to keep experimenting.

Thank you!

-Paddy McCarthy.

Hi Paddy,

I like to get more details about the issue. If you are looking for a way to create job dependencies, some suggestions to use the dependency options with sbatch or qsub commands.

1) sbatch runtime option:  --dependency=afterok:

2) qsub runtime option: -W depend=afterokarray: or -W after:

On different machine, I have used another slurm option: "--begin=". I mean to submit jobs in a regular frequency. There could be a similar option to PBS/Torque scheduler. Let me know if you want to follow up more. I just got my Cheyenne a few weeks ago. 

--Jong Kim

 

Hi Jong-

Thanks for the reply. For the Short-Range Weather App, the way to advance the workflow is to repeatedly run the script 'launch_FV3LAM_wflow.sh' in the experiment area. The recommendation is to set up a cron job to perform this task. Since cron is not available on cheyenne, I'm looking for alternatives. In this case I'm not submitting jobs to the queue directly.

Thanks, Jong-

My answer is going to expose my ignorance. The SRW launch script does use rocoto. And your suggestion makes me think about what's really going on. When I get kicked off cheyenne (and the launch script doesn't run for a while), I was thinking that the jobs (such as run_fcst) were not finishing. But that doesn't make sense. It's likely the run_fcst job completes and the rocoto database just isn't updating, and any subsequent jobs (such as the post jobs that are kicked off after each timestep of the forecast is complete) do not start. But when I try to resume the calls to the launch script at that point, things are out of whack and never get going again. The run_fcst job, for example, gets resubmitted and is stuck in the rocoto "SUBMITTING" state indefinitely. 

I don't really know enough about how everything works together to debug. That's why I was initially asking for a solution to reliably run the launch script without using cron. But I am definitely open to using rocoto directly if that would work.

Thanks again for your ideas, 

-Paddy.

Usually, following examples of about 4 rocoto commands are useful to me:

rocotorun -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10

rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10

rocotorewind -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10 -t [TASKID]

rocotocomplete -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10 -t [TASKID]

Sometimes, I directly check each task logs and see if I have to rewind or mark out as complete.

--Jong

Thank you. Super-helpful!

I think the core of my problem may be that the forecast is taking so long to run. I will post on a separate topic.

Hi Paddy, 

I use cron on Cheyenne. I wonder why that is not an option for you?

If you try "crontab -e" and include your command (e.g.:

*/03 * * * * cd /glade/p/path-to-/expt_dirs/CONUS_13km_GFSv15p2 && ./launch_FV3LAM_wflow.sh

does that work? 

I am testing a 13km and 25km run now on Cheyenne to do some timing tests. Mike is also feverishly working on trying to figure out why develop is having problems on Cheyenne. Hopefully more feedback soon.

Jamie

I'll check with support. I get this:

paddy@cheyenne3:~$ crontab -l
You (paddy) are not allowed to access to (crontab) because of pam configuration.
paddy@cheyenne3:~$ 

Pam doesn't like me. ;)

Use the correct crontab!

CISL kindly pointed me to the correct crontab to use on cheyenne:

/glade/u/apps/ch/opt/usr/bin/crontab

 

(I was using /usr/bin/crontab, which upsets pam)

Great! Glad that was easily solved. There is a bug when running the cron on Cheyenne that is known but not yet included in develop. Please modify the 

ufs-srweather-app/regional_workflow/ush/launch_FV3LAM_wflow.sh

to use: 

#!/bin/bash -l

in the very first line. This will be fixed in develop going forward but we want to do a bit more testing to make sure it is safe so for now this needs to be handled manually.

Thanks!