SRW forecasting model performance on Jet


Has any one tested the performance of the SRW model on Jet? I make a test and found that 4 thread configuration uses more resources and the performance is much slower than 1 thread configuration (see attachment, left is for 4 threads and right is for 1 thread. Both runs use vjet partition).

Furthermore, 1 thread configuration can run to finish within about 1 hour 10 minutes. But 4 thread configuration will also time out (3 hour wallclock time) no matter using vjet, sjet or kjet.

Hi Yunheng,

I have been trying to get some simulations running on Jet for the past few days have also been running into a similar issues... Which physics package are you using? I just did a quick test using GFSv15p2 physics package and get about 7-8s wallclock time (202 procs requested, 18s physics timestep, vjet). I have been trying the RRFS_v1_alpha suite and this is where I get something on the order of 30s wallclock time for an 18s timestep... Previously, I have used GSDv0 physics on Jet and got reasonable processing times (3-4s wallclock per 18s timestep).

This doesn't exactly help your problem, but hopefully it gives another data point for a solution.



In reply to by dmwright

I am using the RRFS_v1_alpha suite.

Surprisingly, if I submitted NEMS.exe directly with OMP_NUM_THREADS=1, it works well. If the job was submitted through the regional workflow, it gets stuck. I suspect that the slow performance on Jet is related to the wrapper scripts in the regional workflow.


There is a hard setting for OMP_NUM_THREADS=4 for Jet in regional_workflow/scripts/ (line 127). I just did a test and changing that to 1 does speed things up from ~30s to ~6-7s per time step.



Ah, good catch on the script setting of omp threads. That takes my time per step from about 8 down to about 2.3s for the dx=25-km test. That's with setting it for 48 cores, (12 cores per node, if that does anything), running on ujet.

-- Ted

Great catch, Yunheng. Running a test now on Jet with OMP_NUM_THREADS=1 and changing to mpiexec in This setup can cut the processing time down to under 2s wallclock time per timestep using RRFS_v1_alpha! If just doing one of those changes will reduce it down to around 7-8s wallclock time.


Are there any concerns with running it this way?