SRW user-specific domain forecast failure

I'm a user on Stampede 2. After successfully running the tutorial 25-km CONUS simulation, I was trying to run a test simulation using a model domain that is not pre-configured (following https://ufs-srweather-app.readthedocs.io/en/ufs-v1.0.0/LAMGrids.html). All the pre-processing steps (make_grid, make_orog, make_sfc, make_ics, make_lbcs) finished successfully, but the forecast was not able to integrate forward. The error messages are:

FATAL from PE     2: set_group_update: mpp_domains_stack overflow, call mpp_domains_set_stack_size( 2033281) from all PEs.

My gut feeling is that this might be associated with the number of processors, but I was not able to find any information regarding how to set the number of processors or the domain decomposition.

Thanks!

There is a script to calculate the recommended number of processors:

regional_workflow/ush/get_layout.sh

You may also need to increase the stack size (e.g., unlimited).

Thanks,

Linlin

Permalink

In reply to by Linlin.Pan

Linlin, thanks for the script! But how can I set a certain layout_x and layout_y to run the forecast? For example, after some searching, I have changed to "layout=16,12" in input.nml and "PE_MEMBER01:192" in model_configure (I also specified using 192 cores when executing NEMS.exe), but that seems to be not enough and the following error message appears for many different PE numbers:

FATAL from PE     0: mpp_domains_define.inc: not all the pe_end are in the pelist

Thanks again!

Yunji

Hi, Yunji,

The script does not include the cpus for quilting. The number of CPUs for quilting needs to be added to the total number of CPUs if quilting is used. 

Thanks,

Linlin

Permalink

In reply to by Linlin.Pan

Hi Linlin,

I met this issue too. I have used a user-specific domain for my simulation (407 x 563 with 3km resolution).

I first used 10 nodes with 240 processors, the slurm file shows (also attached): 

FATAL from PE     1: set_group_update: mpp_domains_stack overflow, call mpp_domains_set_stack_size( 2004931) from all PEs.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

FATAL from PE     2: set_group_update: mpp_domains_stack overflow, call mpp_domains_set_stack_size( 2004931) from all PEs.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2

FATAL from PE     8: set_group_update: mpp_domains_stack overflow, call mpp_domains_set_stack_size( 1998775) from all PEs.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 8

FATAL from PE     6: set_group_update: mpp_domains_stack overflow, call mpp_domains_set_stack_size( 1998775) from all PEs.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6

FATAL from PE     3: set_group_update: mpp_domains_stack overflow, call mpp_domains_set_stack_size( 2004931) from all PEs.

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3

FATAL from PE     7: set_group_update: mpp_domains_stack overflow, call mpp_domains_set_stack_size( 1998775) from all PEs.
 

I increased to 30 nodes with 720 processors but the same error occured.

I have tried ulimit -m unlimited. Didn't change anything.

When I run "./get_layout.sh 406 563", I got: 
nx= 406, ny= 563
suggested layout_x= 13, layout_y=14, total= 182
suggested nx= 416, ny= 574

 

Can you help me with this? 

 

Regards,

Haochen
 

Attach Files