Segmentation Fault for NCEP-provided Simple Test Case

Hello!

I am encountering Segmentation Faults each time I attempt to run the Simple Test Case provided by NCEP found here (​https://ftp.emc.ncep.noaa.gov/EIB/UFS/simple-test-case.tar.gz) and am unsure how to troubleshoot the issue. I have reconfigured the jobs multiple times from the original sbatch (SLURM) job submission script to ensure the job processes are not running into "out of memory" issues, but the problem continues and is always at the same point in the start of the run.

For background, I have built all dependencies and UFS from source using the UFS-v1.1.0 github repo releases for the dependencies and the UFS weather model itself. This is the first test run I have attempted to do since compiling everything. Compilation from source was necessary as I am building the model on the local university-owned supercomputer running CentOS 8. All dependencies were built using gcc 10.1.0 and were the source codes from the github repos with the exception of cmake (3.17.3) and openMPI (4.0.4) which were already pre-installed on the system.

All dependencies (NCEPLIBS-external, NCEPLIBS, ufs-weather-model) were compiled with gcc 10.1.0 without errors outside of general compilation warnings and notes. I also checked a few times that each library was calling the libraries compiled from source and not other versions that may already be on the system to ensure it wasn't related to not having the same compilers and libraries used throughout the compilation process.

Attached are the console output, console error, and ufs-generated logfile from the ufs-weather-model simple test case test run showing the segmentation fault. The problem backtace seems to indicate that the ESMF may be where the issue is happening? Also, at the start of the run, there is this line "Error: coll_hcoll_module.c:301 - mca_coll_hcoll_comm_query() Hcol library init failed" which a web search says is related to the UCX init failure also listed, but I'm unfamiliar with UCX and unsure if this is related to the segmentation fault or not.

Any help is appreciated, and thank you in advance.

Tim

 

 

 

Hi Tim.

I am not sure, but that sounds like it could be a ulimit issue. Can you try setting your stacksize to unlimited or as high as you are allowed? The command is --

ulimit -s unlimited

After that, try running the model again.

If that doesn't work, we'll have to try something else.

-Mark

Permalink

In reply to by mark.potts

Hi Mark,

I added the ulimit -s unlimited to my SLURM submission script, but the result was the same. I even made sure to request a full node (128 processor cores, 512GB memory) to make sure my job was the only one running at the time so the command could have maximum effect.

Open to any further suggestions. I'm currently rebuilding the entire dependency stack in a separate location from what I currently have built to see if perhaps something didn't link together correctly in my initial build.

Thanks,

Tim

What version of the ufs-weather-model are you using? Did you clone it from the head of develop or are you using a release version?

Permalink

In reply to by mark.potts

I've been using version 1.1.0 and have tried to remain consistent throughout the build.

For NCEPLIBS and NCEPLIB-external, I've made sure to use -b ufs-v1.1.0 when doing a 'git clone' of the repositories as referenced here: https://github.com/NOAA-EMC/NCEPLIBS-external#get-and-build-the-code

For UFS itself, I used the instructions here (https://ufs-weather-model.readthedocs.io/en/ufs-v1.1.0/BuildingAndRunning.html#downloading-the-weather-model-code) to changeover to the v1.1.0 branch as well as perform the submodule update.

Unfortunately, that version is now really outdated, and to be honest, so is the simple test case. It might make more sense for you to take a look at building the hpc-stack (https://github.com/NOAA-EMC/hpc-stack) and then building the ufs-weather-model from the head of develop using that. If you can build that, you will be able to run any of the regression tests that are currently being used and will be able to build the model with just the atmosphere or with other components like Ocean, Ice, aerosols, etc. We are currently working to get a release of the Short Range Weather application done in the next two months, and that will have much better (and updated) documentation on how to build not only the hpc-stack, but also the application and the weather model. I'll see if I can get you a link to the draft documentation.

 

Permalink

In reply to by mark.potts

Fair enough. My complete rebuild resulted in the exact same SEGFAULT issue despite no compilation, library, or linking errors, so it's probably futile at this point and best to try something new given how out of date that release is. I'll look into the hpc-stack next and see what I can do with that. I was hoping to work through the graduate student test with a few students here to try and prepare them for the future beyond just WRF that they're already well acquainted with, so that is the reasoning behind my trying to use the older version since it seemed to be a "stable" release option to do the GST with. I look forward to the updated SRW app and updated documentation when they're available.

Thank you.

Hi Tim,

The most updated hpc-stack documentation is here, and the most updated SRW App documentation is here (for now). The documentation in the docs folder of the ufs-srweather-app develop branch is also up-to-date and can be built locally, but the "ReadTheDocs" link provided in the README.md file and the wiki is not pointing to the head of develop as it should (we're working on resolving this). In other words, if you build the docs locally from your clone of the SRW App develop branch, those docs will remain up-to-date, but if you click on the "ReadTheDocs" link provided on GitHub, you'll get an old version. 

Best,

Gillian