3km res in cloud

I am in the process of creating documentation to run a containerized version of the SRW GST in the cloud, however, like my previous post in Orion, I am having trouble running the 3km experiment. I have been able to run the 25km control run in the cloud, and have been able to get through the get_ics, get_lbcs, make_grid, make_sfc_climo, and make_orog of the 3km run but am failing make_ics with an error in chgres_cube error. I have been unable to figure out the issue from looking at the logs.The problem could be similar to the one I'm having in Orion, or very different.

Attached is the end of the make_ics.log file and the pdf with the documentation on the steps I have taken so far.

Thanks in advance for your help!

Hi Sam,

Can you please check output from the largest of your PET* files that can be found in the INPUT/tmp_ICS directory?  This will contain information on this failure, since it's related to ESMF.

Thanks.

Sam,

How many nodes and cores are you requesting for chgres_cube?  For a 3-km domain, you need (on average) at least 2 nodes and 6 cores each on NOAA HPC, and sometimes much more than that depending on the size of the domain.  It looks like only two MPI processes were killed during this run of chgres_cube.  Can you try increasing it?

Does this mean I need to change my cloud resource configuration since I'm only using a single c5n.4xlarge instance? Or just the config file (shown below)?

config.sh

MACHINE=LINUX
ACCOUNT="an_account"
EXPT_BASEDIR=$DOCKER_TEMP_DIR/experiment
EXPT_SUBDIR="test_CONUS_3km_GFSv15p2"

VERBOSE="TRUE"

RUN_ENVIR="community"
PREEXISTING_DIR_METHOD="rename"

STMP=$DOCKER_TEMP_DIR/stmp
PTMP=$DOCKER_TEMP_DIR/ptmp

PREDEF_GRID_NAME="RRFS_CONUS_3km"
GRID_GEN_METHOD="ESGgrid"
QUILTING="TRUE"
CCPP_PHYS_SUITE="FV3_GFS_v15p2"
FCST_LEN_HRS="12"
LBC_SPEC_INTVL_HRS="3"

TOPO_DIR=$DOCKER_TEMP_DIR/fix_orog
SFC_CLIMO_INPUT_DIR=$DOCKER_TEMP_DIR/fix_sfc_climo

DATE_FIRST_CYCL="20190615"
DATE_LAST_CYCL="20190615"
CYCL_HRS=( "18" )

EXTRN_MDL_NAME_ICS="FV3GFS"
EXTRN_MDL_NAME_LBCS="FV3GFS"

FV3GFS_FILE_FMT_ICS="grib2"
FV3GFS_FILE_FMT_LBCS="grib2"

WTIME_RUN_FCST="01:30:00"

#
# Uncomment the following line in order to use user-staged external model 
# files with locations and names as specified by EXTRN_MDL_SOURCE_BASEDIR_ICS/
# LBCS and EXTRN_MDL_FILES_ICS/LBCS.
#
USE_USER_STAGED_EXTRN_FILES="TRUE"
#
# The following is specifically for Hera.  It will have to be modified
# if on another platform, using other dates, other external models, etc.
#
EXTRN_MDL_SOURCE_BASEDIR_ICS="$DOCKER_TEMP_DIR/model_data/FV3GFS"
EXTRN_MDL_FILES_ICS=( "gfs.pgrb2.0p25.f000" )
EXTRN_MDL_SOURCE_BASEDIR_LBCS="$DOCKER_TEMP_DIR/model_data/FV3GFS"
EXTRN_MDL_FILES_LBCS=( "gfs.pgrb2.0p25.f003" "gfs.pgrb2.0p25.f006" "gfs.pgrb2.0p25.f009" "gfs.pgrb2.0p25.f012" )
FIXgsm=$DOCKER_TEMP_DIR/fix_am

RUN_CMD_FCST="mpirun -np \${PE_MEMBER01}"

# Twelve (12) core machines
RUN_CMD_UTILS="mpirun -np 12"
RUN_CMD_POST="mpirun -np 12"

# Comment out the next five lines if you want the 12 core settings
# Four (4) core machines
LAYOUT_X="1"
LAYOUT_Y="3"
RUN_CMD_UTILS="mpirun -np 4"
RUN_CMD_POST="mpirun -np 4"

Sam,

Yes, you need to request more than just one process to run this task at 3-km resolution.   We typically use 4 nodes x 12 cores, so please try at least 48 processes.

Ok, I just attempted to run the SRW GST with the attached config file on a Parallelworks cluster with 4 compute nodes, (I had previously been using 1 node) however, I am still getting the same error. I also tried using the "ulimit -u unlimited" command before running the model as suggested by Jebb Stewart. 

I did get past line 3000 in the make_ics log file which is I think further than I have gotten previously. 

Hi Sam,

Sorry for the delay as I was out on vacation this week.  Your config file shows that you're running 12 processes, which is still too low for a 3-km domain.  Can you please try 48?  ESMF requires a large memory footprint.

Thanks.

Ok, yeah I can try that. I also reached out to some other people working on the UFS SRW App, Christopher Harrop and Christina Holt, came to the conclusion that since I am running the application in a standard Docker container, I am only using a single node even if 4 are available, and that I would need to use a service such as Docker Swarm to be successful.

Sam,

Oh, OK, yes, running in a Docker container will definitely constrain your ability to run chgres_cube, since ESMF is so memory intensive.  I hope using Docker Swarm will resolve your issue!

Ok, I probably won't have time to do that since my internship ends next week, but whoever picks up where I left off will be able to use this info. Thanks for all your help!