numerical instability/Overflow in mpp_reproducing_sum with multithreaded run

Hi,
I am trying to run the mr weather model example with 32 MPI ranks and 4 threads per MPI rank.
the run fails with following error - 

[0]  Rayleigh friction E-folding time (days):
[0]            1  0.379150775374218        10.8096140887264
[0]            2  0.963871677296582        16.5818681310322
[0]            3   1.76542623475949        29.0560344329854
[0]            4   2.67225797307616        53.4602794273955
[0]            5   3.70625064534251        110.544757762893
[0]            6   4.88725381108638        293.610300159162
[0]            7   6.23670999273840        1568.14167700556
[20]
[20] FATAL from PE    20: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
[20]
[20] Abort(1) on node 20 (rank 20 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 20
[15]
[15] FATAL from PE    15: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
[15]
[15] Abort(1) on node 15 (rank 15 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
[18]
[18] FATAL from PE    18: Overflow in mpp_reproducing_sum(_2d) conversion of   4.73530E+57
[18]
[18] Abort(1) on node 18 (rank 18 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 18
[22]
[22] FATAL from PE    22: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
[22]
[22] Abort(1) on node 22 (rank 22 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 22


I had used the following in user_nl_ufsatm -
layout = 2,2
write_groups = 8
write_tasks_per_group = 1


Do i need to use different layout/write_groups/write_tasks_per_group values to make the hybrid (MPI+OpenMP) runs work ?​
Also, in case  values in  user_nl_ufsatm could be triggering this issue then are there are some recommendations for the value selection?
Please advice .

I have so far successfully tested run with 128 MPI ranks x 1 OpenMP thread with following user_nl-ufsatm - 
layout = 2,4
write_groups = 80
write_tasks_per_group = 1
 

on the same setup (layout = 2,2  write_groups = 8   write_tasks_per_group = 1) with 32 MPI ranks and  OMP_NUM_THREADS=2 i get  - 

[0]  vgw done
[0]    9.19228845590341       -5.14752887173209      vgw ax
[0]    5.19630711310237       -170.264975069893      vgw ay
[0]   0.786986461401865       0.000000000000000E+000 vgw keddy m2/sec
[0]   5.172914476558549E-002 -0.636876931324140      vgw eps
[0]  PASS: fcstRUN phase 1, na =            0  time is    4.15954065136611
[0]  in fcst run phase 2, na=           0
[22]
[22] FATAL from PE    22: compute_qs: saturation vapor pressure table overflow, nbad=    254
[22]
[22] Abort(1) on node 22 (rank 22 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 22

with  OMP_NUM_THREADS=1 the application works fine and runs to completion successfully.
The application was compiled with intel compiler version 2019 and Operating system is RHEL 8.
Please let me know in case more information is needed from my end on this issue.

i was supplying openmp flag via CFLAGS.  After setting BUILD_THREADED flag, application works fine.