3km resolution segfault

Hi,

I am in the process of running the SRW GST on Orion. While I have been able to run the 25km control run, I have been having difficulties running the 1st experiment which uses 3km resolution. Do you know what the issue would be?

The task that I am seg-faulting on is make_grid in the experiment directory  /work/noaa/epic-ps/sephraim/expt_dirs/test_CONUS_3km_GFSv15p2 using the configurations set in /work/noaa/epic-ps/sephraim/ufs-srweather-app/regional_workflow/ush/ . I attached some screenshots of the rocotostat output along with the log file for the make_grid step. 

Thanks

Hi Sam,

I was able to run a test and produce a 3-km CONUS grid on Orion without a problem using the release branch of the SRW App.  You can find the test directory here:

/work/noaa/hmtb/beck/expt_dirs/test_grid_release

Can you please try again from scratch (reclone/rebuild in a new directory)?

Thanks.

Sam,

I found one of your old directories instead.  Can you please try changing your nodes and cores in the MAKE_GRID task in FV3LAM_wflow.xml as follows:

from:

<nodes>1:ppn=24</nodes>

to:

<nodes>2:ppn=24</nodes>

Then try running it again.

Thanks.

I just tried that, but I'm unfortunately still getting the same error. I just shared the new /work/noaa/epic-ps/sephraim/expt_dirs/test_CONUS_3km_GFSv15p2 directory.

Can you recursively open up global execute access to your ufs-srweather-app directory here:

/work/noaa/epic-ps/sephraim/ufs-srweather-app

I'm unable to "cd" to that directory to check things inside.

Also, can you please try running the make_grid task with <nodes>4:ppn=1</nodes>.  I believe this is a memory issue somehow.

Finally, if that doesn't work, can you try building the ufs-srweather-app using the devbuild.sh script?  You'd use it as follows after cloning the ufs-srweather-app repo: "devbuild.sh orion intel"

Thanks.

Unfortunately neither of those changes worked and both ended up with the same error. The new directory where I built the ufs-srweather-app using the builddev.sh script is called ufs-srweather-app-new.

One of my mentors has suggested trying to run the workflow manually by submitting the batch jobs manually instead of using the workflow so I could generate more helpful error messages. Do you think that would help?

Hi Sam,

Unfortunately, I'm still unable to reproduce this error, so we're continuing to run some tests.  Hopefully we'll have a suggestion for you soon.  You can go ahead and try to run the batch job manually as you mentioned.  It's possible that there is some environment setting that's causing a problem, which could become apparent in that case.

Thanks for trying everything suggested so far.

Ok I'll try that. Here are my environment variables from using the env command if that helps:

 

MANPATH=/apps/lmod/lmod/share/man:/usr/share/man:/apps/share/man:/apps/man:/opt/slurm/share/man

XDG_SESSION_ID=3794265

_ModuleTable003_=VEgiXT0iL2FwcHMvbW9kdWxlZmlsZXMvY29yZSIsfQ==

HOSTNAME=Orion-login-1.HPC.MsState.Edu

__LMOD_REF_COUNT_MODULEPATH=/apps/modulefiles/core:1;/apps/contrib/modulefiles:1;/apps/contrib/miniconda3-noaa-gsl/modulefiles:2

TERM=xterm-256color

SHELL=/bin/bash

LMOD_ROOT=/apps/lmod

HISTSIZE=100

MODULEPATH_ROOT=/apps/modulefiles

SSH_CLIENT=130.18.14.112 39952 22

CONDA_SHLVL=1

CONDA_PROMPT_MODIFIER=(regional_workflow)

LMOD_PKG=/apps/lmod/lmod

FPATH=/apps/lmod/lmod/init/ksh_funcs

QTDIR=/usr/lib64/qt-3.3

OLDPWD=/work/noaa/epic-ps/sephraim/ufs-srweather-app/regional_workflow/ush

LMOD_VERSION=8.3.17

QTINC=/usr/lib64/qt-3.3/include

SSH_TTY=/dev/pts/60

__LMOD_REF_COUNT_LOADEDMODULES=contrib/0.1:1;rocoto/1.3.3:1;miniconda3/3.8:1

QT_GRAPHICSSYSTEM_CHECKED=1

HISTFILESIZE=0

USER=sephraim

LMOD_sys=Linux

LD_LIBRARY_PATH=/apps/contrib/miniconda3-noaa-gsl/3.8/lib:/lib64:/usr/lib64:/lib:/usr/lib:/usr/lib64/qt-3.3/lib:/opt/slurm/lib:/opt/slurm/lib/slurm

CONDA_EXE=/apps/contrib/miniconda3-noaa-gsl/3.8/bin/conda

__LMOD_REF_COUNT__LMFILES_=/apps/modulefiles/core/contrib/0.1:1;/apps/contrib/modulefiles/rocoto/1.3.3:1;/apps/contrib/miniconda3-noaa-gsl/modulefiles/miniconda3/3.8:1

_CE_CONDA=

_ModuleTable001_=X01vZHVsZVRhYmxlXz17WyJNVHZlcnNpb24iXT0zLFsiY19yZWJ1aWxkVGltZSJdPWZhbHNlLFsiY19zaG9ydFRpbWUiXT1mYWxzZSxkZXB0aFQ9e30sZmFtaWx5PXt9LG1UPXtjb250cmliPXtbImZuIl09Ii9hcHBzL21vZHVsZWZpbGVzL2NvcmUvY29udHJpYi8wLjEiLFsiZnVsbE5hbWUiXT0iY29udHJpYi8wLjEiLFsibG9hZE9yZGVyIl09MSxwcm9wVD17fSxbInN0YWNrRGVwdGgiXT0wLFsic3RhdHVzIl09ImFjdGl2ZSIsWyJ1c2VyTmFtZSJdPSJjb250cmliIix9LG1pbmljb25kYTM9e1siZm4iXT0iL2FwcHMvY29udHJpYi9taW5pY29uZGEzLW5vYWEtZ3NsL21vZHVsZWZpbGVzL21pbmljb25kYTMvMy44IixbImZ1bGxOYW1lIl09Im1pbmljb25kYTMvMy44IixbImxv

PATH=/apps/contrib/miniconda3-noaa-gsl/3.8/envs/regional_workflow/bin:/apps/contrib/miniconda3-noaa-gsl/3.8/bin:/apps/contrib/rocoto/1.3.3/bin:/apps/contrib/miniconda3-noaa-gsl/3.8/condabin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/lib64/qt-3.3/bin:/apps/sbin:/apps/bin:/opt/slurm/bin

CONDA_PREFIX=/apps/contrib/miniconda3-noaa-gsl/3.8/envs/regional_workflow

PWD=/work/noaa/epic-ps/sephraim/expt_dirs/test_CONUS_3km_GFSv15p2

_LMFILES_=/apps/modulefiles/core/contrib/0.1:/apps/contrib/modulefiles/rocoto/1.3.3:/apps/contrib/miniconda3-noaa-gsl/modulefiles/miniconda3/3.8

LANG=en_US.UTF-8

MODULEPATH=/apps/modulefiles/core:/apps/contrib/modulefiles:/apps/contrib/miniconda3-noaa-gsl/modulefiles

LOADEDMODULES=contrib/0.1:rocoto/1.3.3:miniconda3/3.8

_ModuleTable_Sz_=3

KDEDIRS=/usr

LMOD_CMD=/apps/lmod/lmod/libexec/lmod

_CE_M=

HISTCONTROL=ignoredups

__LMOD_SET_FPATH=1

SHLVL=1

HOME=/home/sephraim

__LMOD_REF_COUNT_PATH=/apps/contrib/miniconda3-noaa-gsl/3.8/bin:1;/apps/contrib/rocoto/1.3.3/bin:1;/apps/contrib/miniconda3-noaa-gsl/3.8/envs/regional_workflow/bin:1;/apps/contrib/miniconda3-noaa-gsl/3.8/condabin:1;/sbin:1;/usr/sbin:1;/bin:1;/usr/bin:1;/usr/lib64/qt-3.3/bin:1;/apps/sbin:1;/apps/bin:1;/opt/slurm/bin:1

_ModuleTable002_=YWRPcmRlciJdPTMscHJvcFQ9e30sWyJzdGFja0RlcHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0ibWluaWNvbmRhMyIsfSxyb2NvdG89e1siZm4iXT0iL2FwcHMvY29udHJpYi9tb2R1bGVmaWxlcy9yb2NvdG8vMS4zLjMiLFsiZnVsbE5hbWUiXT0icm9jb3RvLzEuMy4zIixbImxvYWRPcmRlciJdPTIscHJvcFQ9e30sWyJzdGFja0RlcHRoIl09MCxbInN0YXR1cyJdPSJhY3RpdmUiLFsidXNlck5hbWUiXT0icm9jb3RvIix9LH0sbXBhdGhBPXsiL2FwcHMvbW9kdWxlZmlsZXMvY29yZSIsIi9hcHBzL2NvbnRyaWIvbW9kdWxlZmlsZXMiLCIvYXBwcy9jb250cmliL21pbmljb25kYTMtbm9hYS1nc2wvbW9kdWxlZmlsZXMiLH0sWyJzeXN0ZW1CYXNlTVBB

BASH_ENV=/apps/lmod/lmod/init/bash

CONDA_PYTHON_EXE=/apps/contrib/miniconda3-noaa-gsl/3.8/bin/python

LOGNAME=sephraim

QTLIB=/usr/lib64/qt-3.3/lib

CVS_RSH=ssh

XDG_DATA_DIRS=/home/sephraim/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share

SSH_CONNECTION=130.18.14.112 39952 130.18.14.111 22

MODULESHOME=/apps/lmod/lmod

CONDA_DEFAULT_ENV=regional_workflow

__LMOD_REF_COUNT_LD_LIBRARY_PATH=/apps/contrib/miniconda3-noaa-gsl/3.8/lib:1;/lib64:1;/usr/lib64:1;/lib:1;/usr/lib:1;/usr/lib64/qt-3.3/lib:1;/opt/slurm/lib:1;/opt/slurm/lib/slurm:1

LMOD_SETTARG_FULL_SUPPORT=no

LESSOPEN=||/usr/bin/lesspipe.sh %s

XDG_RUNTIME_DIR=/run/user/9005

QT_PLUGIN_PATH=/usr/lib64/kde4/plugins:/usr/lib/kde4/plugins

LMOD_DIR=/apps/lmod/lmod/libexec

BASH_FUNC_module()=() {  eval $($LMOD_CMD bash "$@") && eval $(${LMOD_SETTARG_CMD:-:} -s sh)

}

BASH_FUNC_ml()=() {  eval $($LMOD_DIR/ml_cmd "$@")

}

_=/bin/env

Sam,

I can't read the directories/logs for your most recent 3-km tests, so can you please take a look at the following directory where I've successfully tested the make_grid task with both the release and develop branches of the UFS SRW App?

/work/noaa/hmtb/beck/expt_dirs

Take a look at the modules loaded in the make_grid.log file and compare them to what you're using.  Also, please compare the config.sh and var_defns.sh file in each directory to see if you can spot any differences.

Thanks.

I got a permission denied error when trying to access /work/noaa/hmtb/beck/expt_dirs . Are you able to share it? I just tried resharing my experiment directory with you too.

Hi Sam, I just corrected the permissions to my expt_dirs directory.  Nothing jumped out at me in your environment settings, and we had another team member run the latest code and successfully create a 3-km CONUS grid today, so I think it may be time to get Orion admins involved.  Would you be willing to send an email to them (rdhpcs.orion.help@noaa.gov) describing the problem and providing the directory(ies) of your failed runs as well as those of the successful runs that I've completed?  Since this appears to be a very specific system issue, they should be in a better position to determine where this problem is coming from.  Once you've heard back from them, please feel free to update this thread so that our team can follow up with you on any necessary recommendations or changes to the code.  Sorry that this has not yet been resolved.