Debugging

Debugging MESA requires effort and tenacity. There is no single approach that is applicable in all circumstances. This page collects information on frequently useful tools and techniques.

General Techniques

Navigating through source code

Tools like grep — or more advanced variants like ack, ag, or rg — allow you to rapidly search through the code base for occurrences of a given string or pattern. These tools are helpful for finding where a particular function/subroutine is defined or called or where a particular variable is used.

Your text editor (or an IDE) may also provide capabilities for semantic navigation (e.g., jump to definition).

There is auto-generated documentation from Doxygen available at http://mesa.sourceforge.net/dox/. This provides searchable lists of files, function/subroutines, and shows call graphs.

Using a debugger

The MESA SDK contains the GNU debugger gdb. Detailed instructions for using gdb are beyond the scope of the documentation, but it allows you to interactively set breakpoints, step through code, and inspect values.

Using pgstar

The built in pgstar plotting is itself an invaluable a debugging tool. Something may jump out from a plot as strange behavior before it shows up as a problem in the run. You may notice that an issue begins when the model reaches certain conditions and that can provide a useful guide to the problem. After you identify an issue and configure some useful plots, consider setting pgstar_interval = 1 and pause = .true. so that you can step through the model timestep-by-timestep.

Using a profiler

Troubleshooting performance bugs or memory leaks benefits from specialized tools (see Profiling).

Diagnosing Solver Struggles

Identifying when the solver is having difficulties and understanding the root cause is an important subset of MESA debugging. When you see a MESA model experiencing problems (e.g., taking unexpectedly small timesteps due to many retries), you should at the least report the issue to the developers’ list and ideally investigate it a bit yourself. Paying attention to such problems has proved a valuable way of identifying errors and inconsistencies within MESA.

Diagnosing solver struggles primarily relies on internal MESA tooling. The most common cause of solver struggles is incorrect/inaccurate derivatives. This document will illustrate how to go from a failing model down into the depths of the code in order to identify and fix the problem.

Step -1: Introduce a bug

For the purpose of this tutorial, we will introduce a bug in MESA. In more typical circumstances, you or someone else will have kindly already taken care of this step.

Navigate to the subroutine Get_kap_Results in kap/private/kap_eval.f90. After the call to Get_kap_Results_blend_T, add

dlnkap_rad_dlnRho = -dlnkap_rad_dlnRho

thereby making the derivatives of the radiative opacity with respect to density incorrect.

After you’ve make this change, in kap do

./mk
./export

to recompile and export the module.

Step 0: Notice your model has problems

There are many ways in which a MESA model may have problems. It might fail to run with gold tolerances or it might take unexpectedly small timesteps. The timestep is often driven downwards by repeated retries. Information about the retry trigger is printed to the terminal. Or, if the solver requires more iterations than a specified limit (solver_iters_timestep_limit and related controls), it also reduces the timestep and indicates the dt_limit as solver iters.

Note

The debugging steps that follow are not generally applicable when the solver is performing well. For other kinds of bugs (e.g., an option doesn’t produce the expected behavior) one often has clearer idea where the problem may be and so you can head straight to that part of the code.

We will use the 1.3M_ms_high_Z test suite case to illustrate the effect of the bug that we introduced.

In r14460, this case took 377 steps, required 0 retries, and there were a total of 1850 iterations of the newton solver. With the bug, this test case stops when it reaches max_model_number after 500 steps, having done 12 retries and a total of 8145 newton iterations.

Navigate to this test (in star/test_suite) and run it. In the terminal output, you can see that the timesteps are often set by solver iters.

       step    lg_Tmax     Teff     lg_LH      lg_Lnuc     Mass       H_rich     H_cntr     N_cntr     Y_surf   eta_cntr   zones  retry
   lg_dt_yr    lg_Tcntr    lg_R     lg_L3a     lg_Lneu     lg_Mdot    He_core    He_cntr    O_cntr     Z_surf   gam_cntr   iters
     age_yr    lg_Dcntr    lg_L     lg_LZ      lg_Lphoto   lg_Dsurf   C_core     C_cntr     Ne_cntr    Z_cntr   v_div_cs       dt_limit
__________________________________________________________________________________________________________________________________________________

         10   7.186162   6026.290   0.310381   0.310381   1.300000   1.300000   0.639691   0.003250   0.320000  -2.349937    805      0
   4.820702   7.186162   0.119685 -44.790170  -1.003419 -99.000000   0.000000   0.320132   0.018721   0.040000   0.071396     16
 8.4252E+05   1.739919   0.314263 -14.875409 -99.000000  -7.019694   0.000000   0.005826   0.004199   0.040176  0.000E+00  solver iters
                               rel_E_err    5.8461078509513597D-13
                       log_rel_run_E_err      -11.0597023512447645

         13   7.185990   6017.566   0.300802   0.300802   1.300000   1.300000   0.639670   0.003358   0.320000  -2.355541    805      0
   4.873378   7.185990   0.113850 -44.803173  -1.013273 -99.000000   0.000000   0.320139   0.018721   0.040000   0.071285     15
 1.0438E+06   1.737310   0.300077 -14.750470 -99.000000  -7.010156   0.000000   0.005733   0.004199   0.040192  0.000E+00    varcontrol
                               rel_E_err    7.3563886572871659D-13
                       log_rel_run_E_err      -10.9707092147706380

save LOGS/profile2.data for model 13
         20   7.185235   5994.060   0.274258   0.274258   1.300000   1.300000   0.639610   0.003646   0.320000  -2.362919    806      0
   4.851161   7.185235   0.101141 -44.848473  -1.043016 -99.000000   0.000000   0.320157   0.018721   0.040000   0.071188     18
 1.5910E+06   1.733084   0.267860 -15.352530 -99.000000  -6.988274   0.000000   0.005486   0.004199   0.040233  0.000E+00  solver iters
                               rel_E_err    5.1727204553814077D-13
                       log_rel_run_E_err      -10.1566890117973099

Step 1: Activate debugging options

There are standard inlist commands available that can reveal more about the struggle that the solver is going through. They are stored in the aptly-named debugging_stuff_for_inlists which lives in star/test_suite directory. The full list of commands there is

      
      !photo_interval = 1
      !profile_interval = 1
      !history_interval = 1
      !terminal_interval = 1

! FOR DEBUGGING

      !report_solver_progress = .true. ! set true to see info about solver iterations
      !report_ierr = .true. ! if true, produce terminal output when have some internal error
      !stop_for_bad_nums = .true.
      !trace_evolve = .true.
      !fill_arrays_with_NaNs = .true.

      !solver_save_photo_call_number = 0
         ! Saves a photo when solver_call_number = solver_save_photo_call_number - 1
         ! e.g., useful for testing partials to set solver_call_number = solver_test_partials_call_number - 1

      !solver_test_partials_call_number = 4142
      !solver_test_partials_k = 2616
      !solver_test_partials_iter_number = 4
      !solver_test_partials_dx_0 = 1d-6
      !solver_test_partials_var_name = 'o16' ! 'all' or 'lnd', 'lnT', 'lnR', 'L', 'v', etc.    '' means code sets 
      !solver_test_partials_equ_name = 'lnP' ! 'all' or 'dlnE_dt', 'dlnd_dt', 'dlnR_dt', 'equL', etc   '' means code sets 
      !solver_test_partials_sink_name = 'si28' ! iso name to use for "sink" to keep sum = 1
      !solver_test_partials_show_dx_var_name = 'h1'

         ! equ name can also be one of these
            ! 'lnE', 'lnP', 'grad_ad' to test eos
            ! 'eps_nuc' to test net
            ! 'non_nuc_neu' to test neu
            ! 'gradT', 'mlt_vc' to test mlt
            ! 'opacity' to test kap

      !solver_test_partials_write_eos_call_info = .true.
      
      !solver_test_partials_k_low = -1
      !solver_test_partials_k_high = -1
      
      !solver_test_eos_partials = .true.
      !solver_test_kap_partials = .true.
      !solver_test_net_partials = .true.
      !solver_test_atm_partials = .true.

      !report_all_dt_limits = .true.
      !report_solver_dt_info = .true.
      
      !show_mesh_changes = .true.
      !mesh_dump_call_number = 5189
      !okay_to_remesh = .false.
      
      !energy_conservation_dump_model_number = -1
      
      !use_DGESVX_in_bcyclic = .true.
      !use_equilibration_in_DGESVX = .true.
      !report_min_rcond_from_DGESXV = .true.

      ! solver debugging
      !solver_check_everything = .true.
      
      !solver_epsder_struct = 1d-6
      !solver_epsder_chem = 1d-6
      
      !report_solver_dt_info = .true.
      !report_dX_nuc_drop_dt_limits = .true.
      !report_bad_negative_xa = .true.

Copy the contents of debugging_stuff_for_inlists to the controls section of your inlist. Note that many test suite inlists already contain these options.

Begin by uncommenting the following lines to get info about the newton solver progress:

report_solver_progress = .true. ! set true to see info about newton iterations
report_ierr = .true. ! if true, produce terminal output when have some internal error

Step 2: Run the model and find the bad spot

Running the model again will now produce a lot more output, giving information about each and every call to the solver. That output will look something like this:

                                 solver_call_number          17
                              gold tolerances level           2
                   correction tolerances: norm, max    3.0000000000000001D-05    3.0000000000000001D-03
                tol1 residual tolerances: norm, max    9.9999999999999994D-12    1.0000000000000001D-09
  1  coeff  1.0000     avg resid  0.518E-05      max resid dv_dt     1  0.111E-01  mix type xx000     avg corr  0.474E-03    max corr lnd   556  0.710E-02  mix type 00000     avg+max corr+resid
  2  coeff  1.0000     avg resid  0.561E-06      max resid dv_dt     1  0.275E-02  mix type xx000     avg corr  0.140E-03    max corr lnd   493  0.140E-02  mix type 11100     avg corr, avg+max resid
  3  coeff  1.0000     avg resid  0.384E-06      max resid dv_dt     1  0.521E-02  mix type xx000     avg corr  0.262E-04    max corr lnd   494  0.253E-03  mix type 11000     avg+max resid
  4  coeff  1.0000     avg resid  0.681E-06      max resid dv_dt     1  0.819E-02  mix type xx000     avg corr  0.560E-04    max corr lnd   494 -0.555E-03  mix type 11000     avg corr, avg+max resid
                tol2 residual tolerances: norm, max    1.0000000000000000D-08    1.0000000000000001D-05
  5  coeff  1.0000     avg resid  0.104E-05      max resid dv_dt     1  0.121E-01  mix type xx000     avg corr  0.780E-04    max corr lnd   494  0.758E-03  mix type 11000     avg corr, avg+max resid
  6  coeff  1.0000     avg resid  0.159E-05      max resid dv_dt     1  0.177E-01  mix type xx000     avg corr  0.113E-03    max corr lnd   494 -0.110E-02  mix type 11000     avg corr, avg+max resid
  7  coeff  1.0000     avg resid  0.251E-05      max resid dv_dt     1  0.263E-01  mix type xx000     avg corr  0.166E-03    max corr lnd   494  0.161E-02  mix type 11000     avg corr, avg+max resid
  8  coeff  1.0000     avg resid  0.403E-05      max resid dv_dt     1  0.383E-01  mix type xx000     avg corr  0.246E-03    max corr lnd   494 -0.238E-02  mix type 11000     avg corr, avg+max resid
  9  coeff  0.5000     avg resid  0.185E-05      max resid dv_dt     1  0.936E-02  mix type xx000     avg corr  0.359E-03    max corr lnd   494  0.348E-02  mix type 11000     avg+max corr+resid
 10  coeff  0.5000     avg resid  0.777E-06      max resid dv_dt     1  0.227E-02  mix type xx000     avg corr  0.857E-04    max corr lnd   494 -0.827E-03  mix type 11000     avg corr, avg+max resid
 11  coeff  1.0000     avg resid  0.262E-06      max resid dv_dt     1  0.329E-02  mix type xx000     avg corr  0.226E-04    max corr lnd   494  0.220E-03  mix type 11000     avg+max resid
 12  coeff  0.5000     avg resid  0.660E-07      max resid dv_dt     1  0.754E-03  mix type xx000     avg corr  0.306E-04    max corr lnd   494 -0.296E-03  mix type 11000     avg corr, avg+max resid
 13  coeff  0.5000     avg resid  0.190E-07      max resid dv_dt     1  0.185E-03  mix type xx000     avg corr  0.712E-05    max corr lnd   494  0.691E-04  mix type 11000     avg+max resid
 14  coeff  0.5000     avg resid  0.540E-08      max resid dv_dt     1  0.409E-04  mix type xx000     avg corr  0.172E-05    max corr lnd   494 -0.166E-04  mix type 11000     avg+max resid
 15  coeff  0.5000     avg resid  0.200E-08      max resid dv_dt     1  0.106E-04  mix type xx000     avg corr  0.382E-06    max corr lnd   494  0.369E-05  mix type 11000     max resid
 16  coeff  0.5000     avg resid  0.754E-09      max resid dv_dt     1  0.216E-05  mix type xx000     avg corr  0.103E-06    max corr lnd   494 -0.100E-05  mix type 11000     okay!

The first line gives solver_call_number, which uniquely identifies a particular solver call. (It will differ from the model number after retries have occurred.) Then, there is an indication of what set of solver tolerances are being used and what those tolerances are. After that, there is a series of lines that gives iteration-by-iteration summaries of what the solver did. The iteration number is the second column. After certain numbers of iterations (user-defined, see e.g., gold_iter_for_resid_tol2), the solver relaxes the residual tolerances and the new tolerances are indicated.

Other columns show information about the residuals and corrections. The “residual” is the left over difference between the left and right hand sides of the equation we are trying to solve. We do iterations to reduce that, but we are limited by the non-linearity of the problem and the quality of the estimates for the derivatives. The “correction” is the change in the primary variable, calculated using Newton’s method, so the Jacobian and residuals give a correction that would make the next residual vanish if the problem were linear and the Jacobian were exact, neither of which are true. So the best we can hope for is that the corrections will get smaller next iteration. When the maximum residual/correction is shown, the equation/variable and its cell index are indicated. The mixing types for the cells around these locations are also indicated. The 1’s are the convection and the 0’s are for no-mixing. (The full list of mixing types is in const_def.)

The last column is the simplest to understand. It tells us that for the early iterations, the solution was rejected because most everything was unsatisfactory. Then, as the iterations proceed and the residuals and corrections shrink, eventually only the maximum residual is unsatisfactory. Finally, the solver decides the solution is good enough.

Note

Less artificial bugs may behave somewhat differently than this toy problem. In this example, the solver is pretty much always struggling, so you could pick almost any step as the bad one. In practice, models often only experience problems under specific conditions and so a much smaller subset of steps are bad. Generally, you want to pick a step just as things start to go wrong. In this example, the solver also eventually succeeded. In practice, one often see cases where the newton iterations will completely fail to drive the max residual below the tolerances and the solver will give up, forcing a retry and reducing the timestep.

Step 3: Check the partial derivatives

From the terminal output in the previous part, we can see that the max residual of dv_dt equation is difficult to reduce and is located in cell 1.

We want to find the source of this bad residual by checking correctness of the partial derivatives of the corresponding equation. MESA can compare the analytic derivative that it has been given with a numerical estimate of the derivative based on finite differences and Ridder’s method of polynomial extrapolation (aka dfridr). To do so, we must ask for the derivatives at a particular cell at a particular iteration number of a particular solver call. This can be achieved with the inlist options:

solver_test_partials_call_number = 17
solver_save_photo_call_number = 17
solver_test_partials_iter_number = 8
solver_test_partials_k = 1
solver_test_partials_equ_name = 'dv_dt'
solver_test_partials_var_name = 'all'
solver_test_partials_dx_0 = 1d-6

Setting solver_save_photo_call_number = 17 produces a photo at the preceding step (x016) and so re-running the bad step becomes as easy as doing ./re. Debugging is itself an iterative process, so you want to make sure you can rapidly and consistently reproduce the problem. A little time invested up front letting you easily and rapidly trigger the problem pays dividends when you have repeat the procedure a dozen times.

Now when you do ./rn (or ./re after the photo exists) and the model will stop after the specified solver call and you’ll see

                                         equ name dv_dt
*****                 log dfridr rel_diff partials wrt  lnd(k)   of dv_dt(k)      1     0.677   log uncertainty    -9.668   analytic   -2.9597633549276736E-01   numeric   -1.7033292388187307E+00
                      log dfridr rel_diff partials wrt  lnd(k+1) of dv_dt(k)      1   -99.000   log uncertainty   -99.000   analytic    0.0000000000000000E+00   numeric    0.0000000000000000E+00

                      log dfridr rel_diff partials wrt  lnT(k)   of dv_dt(k)      1   -10.058   log uncertainty   -10.644   analytic   -6.9708999322145324E+00   numeric   -6.9708999316051248E+00
                      log dfridr rel_diff partials wrt  lnT(k+1) of dv_dt(k)      1   -99.000   log uncertainty   -99.000   analytic    0.0000000000000000E+00   numeric    0.0000000000000000E+00

?????                   log dfridr rel_diff partials wrt  L(k)   of dv_dt(k)      1    -4.255   log uncertainty    -4.272   analytic    6.3131382934316702E-39   numeric    6.3127875526588140E-39
                        log dfridr rel_diff partials wrt  L(k+1) of dv_dt(k)      1   -99.000   log uncertainty   -99.000   analytic    0.0000000000000000E+00   numeric    0.0000000000000000E+00

                      log dfridr rel_diff partials wrt  lnR(k)   of dv_dt(k)      1   -11.110   log uncertainty   -10.064   analytic   -2.0080506394821409E+00   numeric   -2.0080506394665569E+00
                      log dfridr rel_diff partials wrt  lnR(k+1) of dv_dt(k)      1   -99.000   log uncertainty   -99.000   analytic    0.0000000000000000E+00   numeric    0.0000000000000000E+00

This output shows the derivatives in the cell for each equation with respect to each variable. The column after the cell number shows the relative difference between the numerical (dfridr) and analytic derivatives. When things are working well, the relative differences go down to < 1d-12, but don’t panic if you see 1d-7. That can still work. But 1d-4 or more is a bad sign. Here, our bad derivative jumps out with an order unity relative error and is highlighted by the **** on the left side. So we now know the bad derivative is the one with respect to density.

The relative uncertainty in the numerical estimate of the derivative is also indicated. When this is comparable to the relative error between the numerical and analytic derivatives, then you’re probably not learning anything about the quality of the analytic derivative. (See for example the line marked with ???? above.) Sometimes that sort of result can be a false alarm caused by the particular choice of starting difference for the dfridr search. This is set by solver_test_partials_dx_0 and is typically something around 1d-6. Usually that works, but if you get a large err that makes no sense, you should try both larger and smaller values of to see if the dfridr result changes.

It is often the case that bad derivatives in an equation come from bad derivatives in the microphysical inputs that enter into the equation. Since this is common, MESA supports easily checking the basic eos, kap, and net quantities. In particular, the value of solver_test_partials_equ_name can be set to 'lnE' or 'lnP' (testing eos), to 'eps_nuc' (testing net), or 'opacity' (testing kap).

Apply your knowledge about which equation showed the issue. If the problem were in the energy equation, you would probably want to test lnE and eps_nuc. Since this issue appeared in the momentum equation, we will want to test lnP and since it was in the surface cell, we will want to test opacity as that enters into the atmosphere boundary condition.

We already found the issue was in the density derivatives, so we can now restrict our attention to that derivative by setting solver_test_partials_var_name = 'lnd'.

Setting solver_test_partials_equ_name == 'lnP' and restarting shows no problem

log dfridr rel_diff partials wrt  lnd(k)   of lnP(k)      1    -9.791   log uncertainty   -12.139   analytic    9.9965278727210283E-01   numeric    9.9965278743368469E-01
log dfridr rel_diff partials wrt  lnd(k+1) of lnP(k)      1   -99.000   log uncertainty   -99.000   analytic    0.0000000000000000E+00   numeric    0.0000000000000000E+00

but solver_test_partials_equ_name == 'opacity' and restarting shows

*****               log dfridr rel_diff partials wrt  lnd(k)   of opacity(k)      1     0.301   log uncertainty    -9.108   analytic   -2.9110358120328123E-01   numeric    2.9110358102038758E-01
                    log dfridr rel_diff partials wrt  lnd(k+1) of opacity(k)      1   -99.000   log uncertainty   -99.000   analytic    0.0000000000000000E+00   numeric    0.0000000000000000E+00

Comparing the bad analytic derivative to the numerical one reveals the sign error that we introduced.

We’ve reached the end of the part of debugging that can be done with inlists only. We will need to edit the code in order to MESA to dive deeper into a module or to check individual terms in the equation derivatives.

As a test case maintainer, this may be the point at which you stop, report the issue, and turn things over to others. If the issue is in the eos, you might dig in a bit more to see which component of the eos is active by setting solver_test_partials_write_eos_call_info = .true. and include that info in your report.

Step 5: Descend into an individual module

If the bug has already been found, this step isn’t necessary, but often one must continue deeper into a module. For the modules atm, eos, kap, net, and rates, it is possible to continue to do some of this debugging from star. This is extremely helpful because it means that the precise properties of the troublesome cell from star are passed into the module.

In the inlist set:

solver_test_kap_partials = .true.

(Note that using this option automatically sets MESA to run in single-thread mode.)

Since will specify which quantity to check in the code, we should stop selecting the equation in the inlist, but can continue to specify which derivative of that quantity to check:

solver_test_partials_equ_name = ''
solver_test_partials_var_name = 'lnd'

The variables kap_test_partials_val and kap_test_partials_dval_dx can now be set within kap and thereby communicate their values into star and dfridr.

At the end of Get_kap_results in kap/private/kap_eval.f90 — near where we introduced our bug — add

kap_test_partials_val = log(kap_rad)
kap_test_partials_dval_dx = dlnkap_rad_dlnRho

Again, the key point of this step is that we’re passing up a value (kap_rad) that we wouldn’t normally have access to from star. This procedure allows you to debug deep within a module from star.

We changed code in kap, so we need to do ./mk && ./export in star and then in the work directory also recompile before we restart ./mk && ./re.

The output should clearly illustrate that this derivative has the wrong sign. Before making a change that corrects the value of dlnkap_rad_dlnRho, consider putting what you think the right answer is in kap_test_partials_dval_dx and re-running to check that your new expression is OK. (If you first fix the bad derivative at the location of the bug, that change will mean that the newton iterations won’t be the same any more and you may not hit the same conditions and or trigger dfridr.)

After you’ve checked, make the bug fix, turn off the debugging options, run your model, and hopefully see improved performance.

Advanced: Investigate the bad cell and equation in detail

Note

We will now pretend we didn’t find the bug and re-use this setup to illustrate how to dig in at an even deeper level.

MESA equations and structure variables have integer indices. The full list is in star_data/public/star_data.inc, but the most common ones are

! index definitions for the variables (= 0 if variable not in use)
   integer :: i_lnd ! ln(cell density average by mass)
   integer :: i_lnT ! ln cell temperature average by mass
   integer :: i_lnR ! ln radius at outer face of cell
   integer :: i_lum ! luminosity at outer face of cell
   integer :: i_v ! Lagrangian velocity at outer face of cell
   integer :: i_u ! Lagrangian velocity at center of cell
   integer :: i_lnPgas ! ln cell gas pressure average by mass

! index definitions for the equations (= 0 if equation not in use)
   integer :: i_dlnR_dt ! dlnR/dt = v/r
   integer :: i_dv_dt ! momentum conservation using v
   integer :: i_du_dt ! momentum conservation using u
   integer :: i_dlnd_dt ! mass conservation
   integer :: i_dlnE_dt ! energy conservation
   integer :: i_equL ! luminosity

In Step 3, we found that we want to investigate the lnd derivative of the dv_dt equation in cell k = 1. If we hadn’t been able to so easily narrow things down to kap, then we would need to make a more detailed investigation of the equation in this cell.

To do so, go to star/private/hydro_eqns.f90 and follow the flow subroutine eval_equ_for_solver until you reach the equation that you want investigate.

The surface cell (k = 1) is special and so we see that the dv_dt equation for this cell is provided by PT_eqns_surf. Go to that subroutine. Continue to follow the logic until you get the place that defines the residual of the desired equation. In this case, that is yet one more level down in set_Psurf_BC.

There you’ll see the code that indicates that we’ve reached the dv_dt equation in cell 1.

if (s% u_flag) then
   i_eqn = i_du_dt
else
   i_eqn = i_dv_dt
end if

equ(i_eqn,1) = lnP_bc - s% lnP(1)

This is an extremely simple equation. By driving the residual to zero, MESA requires that the surface cell pressure match the imposed boundary condition.

Now, we want to tell MESA to test the partial derivatives. Near the start of the routine, set

test_partials = (s% solver_iter == s% solver_test_partials_iter_number)
!test_partials = .false.

Almost all the equation routines will already have test_partials code like this. And once you reach one routine that has test_partials, most of the routines it calls are usually also set up to check partials. (But if you are in uncharted territory you might have to add them yourself.)

Note

Since PT_eqns_surf and set_Psurf_BC are only called for k=1, we do not test for zone number. In other equations that are evaluated for multiple k, the analogous test will usually look like

test_partials = (k == s% solve_test_partials_k .and. s% solver_iter == s% solver_test_partials_iter_number)
!test_partials = .false.

Set solver_test_partials_val to a variable that is used in calculating the equation residual. Let’s begin with the residual itself, which should reproduce the bad entry from the previous step in this tutorial.

if (test_partials) s% solver_test_partials_val = equ(i_eqn,1)

Find the call on e00 for i_lnd to find the relevant partial with respect to density.

call e00(s, xscale, i_eqn, i_lnd, 1, nvar, dlnP_bc_dlnd - s% chiRho_for_partials(1))

Note

Recall the MESA Jacobian has a block tridiagonal structure (Fig. 47 in MESA II). The e00 idiom refers to forming the block corresponding to the derivatives of the equations in cell k with respect to the variables in cell k. Similarly, em1 and ep1 refer to the the derivatives of the equations in cell k with respect to the variables in cell k-1 and in cell k+1 respectively. The 3rd and 4th arguments to these subroutines indicate that this call is setting the derivative of a specific equation (recall i_eqn = i_dv_dt earlier) with respect to a specific variable (here i_lnd).

Set solver_test_partials_var to the thing you want to take a derivative with respect to and solver_test_partials_dval_dx to the analytic derivative.

if (test_partials) then
   s% solver_test_partials_var = i_lnd
   s% solver_test_partials_dval_dx = dlnP_bc_dlnd - s% chiRho_for_partials(1)
   write(*,*) 'set_Psurf_BC', s% solver_test_partials_var
end if

We changed code in star, so we need to do ./mk && ./export in star and then in the work directory also recompile before we restart ./mk && ./re. We will repeat this recompilation/restart process many times, so do it in a way that makes it easy for you do over and over.

Since we are now specifying which derivative to check in the code, we must also stop selecting the equations/variables in the inlist.

solver_test_partials_equ_name = ''
solver_test_partials_var_name = ''

solver_test_kap_partials = .false.

After that edit, once you recompile and restart you should see

    set_Psurf_BC           1


                  analytic and numeric partials wrt lnd   -2.9597633549276736D-01   -1.7033292388187307D+00
    log dfridr relative uncertainty for numeric partial   -9.6678068357602811D+00
                                               rel_diff    4.7549507665296238D+00

STOP done solver_test_partials

Again, this looks bad! In the final line, we can see that the analytic estimate of the derivative (what goes in the Jacobian) is order-unity different from the numerically evaluated derivative.

Now that we’ve confirmed which derivative is the problem, we must look for which term or terms are responsible. There are only two in this equation, lnP_bc - s% lnP(1), so you can test them one by one

s% solver_test_partials_val = lnP_bc

and

s% solver_test_partials_var = i_lnd
s% solver_test_partials_dval_dx = dlnP_bc_dlnd

Recompile and restart to confirm that this is the bad term.

You can repeat this with the other term to confirm that lnP(1) and its derivative s% chiRho_for_partials(1) are OK. If they weren’t, then it would be off to the EOS to figure out why. We’ll discuss how to dive deeply into a specific module in the next step.

Once you locate the bad term, find where that term is defined, so that you can investigate what is causing it to be in error. On our way down, we saw that dlnP_bc and its derivatives are set in PT_eqns_surf. So now turn off the partial tests in set_Psurf_BC, switch to PT_eqns_surf, turn on the partial tests there.

Note

When you finish testing a partial in one place, don’t forget to turn off the test in that place before moving on to the new place. Typically the code that reports results by setting solver_test_partials_val, solver_test_partials_var, and solver_test_partials_dval_dx, will also have a write that says who it is. This can be a lifesaver if/when you do forget to turn off one routine before turning on another. The terminal output will show the problem and tell you where it is.

We can start by testing the partial of lnP_bc to confirm our previous finding. (Sample code omitted so you can practice setting the variables yourself.) That should give

   set_Psurf_BC           1

                  analytic and numeric partials wrt lnd    7.0157923462638694D-01   -7.0157923451592796D-01
        dfridr relative uncertainty for numeric partial   -1.6481349835630704D-12
                                               rel_diff    2.0000000001574434D+00

STOP done solver_test_partials

and at this point you can clearly see the sign error we introduced. With that confirmed, we want to start dissecting

dlnP_bc_dlnd = dlnP_bc_dlnPsurf*dlnPsurf_dlnkap*dlnkap_dlnd

For more complex expressions involving many terms, you might need to test term by term.

Note

The dfridr machinery in MESA tests the derivatives with respect to structure variables, so it can’t magically check every derivative. To check dlnP_bc_dlnPsurf you can look at the expression relating lnP_bc and lnP_surf in this routine and confirm it with pencil and paper. To check dlnPsurf_dlnkap you would have to follow the code down into atm. Once there, you could check by hand or set up your own dfridr code that passes in varying values of kap and exercises the module independent of star.

Here, since there’s only one dlnd derivative, dlnkap_dlnd is the clear place to start (even if we didn’t already know it was the issue). Set

s% solver_test_partials_val = log(s% opacity(1))

s% solver_test_partials_var = s% i_lnd
s% solver_test_partials_dval_dx = dlnkap_dlnd

and dfridr should confirm that the bad derivative is the density derivative of the opacity.

The surface cell momentum equation is one of the simpler equations. Two other types of derivatives often arise: derivatives with respect to quantities in neighboring cells and derivatives with respect to composition. The energy conservation equation (get1_energy_equation in hydro_equ_l.f90) provides a good example of an equation with both these features.

When testing a derivative like d_esum_dlnTm1, that is a partial of a quantity esum(k) with respect to lnT(k-1), note that the dfridr code perturbs the structure of cell k = solver_test_partials_k. That is, it can test the partials of equations in cell k-1, k, and k+1 with respect to the structure variables in cell k. (In practice if there is a problem with partials of a quantity in cell k then there are also problems in cells k+1 and k-1. But if you really need to test one of these derivatives in a specific cell, you can always adjust solver_test_partials_k in the inlist.)

Therefore, to check d_esum_dlnTm1 one can do

test_partials = (k-1 == s% solver_test_partials_k .and. s% solver_iter == s% solver_test_partials_iter_number)
!test_partials = .false.

and

s% solver_test_partials_val = esum
s% solver_test_partials_var = s% i_lnT
s% solver_test_partials_dval_dx = d_esum_dlnTm1

This will check the derivative of esum in cell solver_test_partials_k+1 with respect to lnT in cell solver_test_partials_k.

Checking composition derivatives requires setting a sink isotope such that the dfridr perturbations to the chosen isotope are offset by changes in a “sink” isotope to keep the mass fraction sum = 1. The sink needs to be inert in the current situation and have enough abundance to absorb the changes.

When debugging from via the inlist, this isotope can be specified using the control solver_test_partials_sink_name. So for example, addition to setting the usual k, call_number, iter_number, and dx_0, to check partials of the energy equation with respect he4, you might set the following

solver_test_partials_equ_name = 'dlnE_dt'
solver_test_partials_var_name = 'he4'
solver_test_partials_sink_name = 'fe56'

If solver_test_partials_dx_0 is too large, you’ll get an error if the mg24 mass fraction is too small to deal with the desired increase in the targeted isotope.

When you need to check composition derivatives of specific terms, you must pick the various test_partials values in the source code. This can be tricky in that in some places the abundance is given as an index in xa and in others as a variable number for the solver. This offset is s% nvar_hydro, that is the index of the first chem equation is s% nvar_hydro + 1.

If you set solver_test_partials_write_eos_call_info = .true., the output includes also includes composition information like the following so it is easy to find the needed numbers.

xa(j,k) neut           1           6        3312    1.0010960558047912D-99
  xa(j,k) h1           2           7        3312    0.0000000000000000D+00
xa(j,k) prot           3           8        3312    1.0010960558047908D-99
 xa(j,k) he3           4           9        3312    0.0000000000000000D+00
 xa(j,k) he4           5          10        3312    6.7423592032960378D-01
 xa(j,k) c12           6          11        3312    1.0630746138489515D-01
 xa(j,k) n14           7          12        3312    6.1988280539418319D-03
 xa(j,k) o16           8          13        3312    2.0390257320339500D-01
xa(j,k) ne20           9          14        3312    5.7660956389211995D-03
...
xa(j,k) fe56          20          25        3312    1.4153093511237100D-03

So then the in the code you would do something like

if (test_partials) then
   s% solver_test_partials_dx_sink = 20 ! fe56 (index in xa)
   s% solver_test_partials_var = 10 ! he4 (variable number)
   s% solver_test_partials_dval_dx = s% d_epsnuc_dx(5,k) ! he4 (index in xa)
   write(*,*) 'get1_energy_eqn', s% solver_test_partials_var
end if

Diagnosing Meshing Problems

Note

This was the file DEBUG_mesh.notes. It needs modernized.

to get info about why the mesh is doing what it is doing,

in controls section of inlist:

show_mesh_changes = .true.

run to get current mesh call number, n

in controls section of inlist:

mesh_dump_call_number = n

run to get plot data

mesh_plan.rb shows input data that was used to form the plan

mesh.rb shows new mesh (with some old values interpolated to the new mesh for comparison)

Floating point exceptions

Floating point exceptions (fpe) occur when the code attempts to perform an illegal math operation, for instance x/0, sqrt(-1), or log(0). In these cases any number that uses the result of one of these expressions is also undefined.

To test MESA for floating point exceptions we can turn on the compiler checks that will flag when an issue occurs:

Set the environment variable

MESA_FPE_CHECKS_ON=1

Then run ./clean and ./install in MESA_DIR.

This setting adds the following options in utils/makefile_header

   FCbasic += $(FCtrapNANs) -fbounds-check -finit-derived -Wuninitialized -Warray-bounds -fstack-protector-all -D_FORTIFY_SOURCE=2
   FCopt = -O0 -ftree-vectorize 
   SKIP_NAN_TRAPS = NO

and acts as if you set the controls inlist option

fill_arrays_with_nans = .true.

to make sure you catch any uninitialized array accesses.

Step -1: Introduce a bug (again)

Lets add a fpe in a test case to see what happens:

Alter star/test_suite/1.3M_ms_high_Z/src/run_star_extras.f90 and in extras_finish_step add:

write(*,*) log(-1d0*s% model_number)

Step 0: Notice your code has a problem

Run ./mk and ./rn to run the test case and you’ll get the following output:

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7efca94096af in ???
#1  0x7efca9cc6330 in ???
#2  0x405f10 in __run_star_extras_MOD_extras_finish_step
      at ../src/run_star_extras.f90:155
#3  0x425372 in __run_star_support_MOD_after_step_loop
      at ../job/run_star_support.f90:634
#4  0x427d63 in __run_star_support_MOD_run1_star
      at ../job/run_star_support.f90:162
#5  0x407076 in __run_star_MOD_do_run_star
      at ../../../../star/job/run_star.f90:26
#6  0x4070eb in run
      at ../src/run.f90:13
#7  0x407137 in main
      at ../src/run.f90:2
./rn1: line 6: 2166962 Floating point exception(core dumped) ./star

Step 1 Diagnose the issue

The important line is the first line that does not include ???, in this case it is

#2  0x405f10 in __run_star_extras_MOD_extras_finish_step
      at ../src/run_star_extras.f90:155

This is telling us that the problem occurs at line 155 in ../src/run_star_extras.f90

Note

The ../src is because the file path is relative to the make/ folder in your work folder

At this point you should go to that line and work out why the number is undefined.

Uninitialised variables

Uninitialised variables occur when we use a variable before it has been defined, for example

real(dp) :: x,y
y = 2 * x

If you are lucky then x will be zero, if you are unlucky it can be any random number that happened to be in the memory location that x occupies. This can lead to results that change either each time the model runs or gives different values between different machines.

By default MESA compiles with the -finit-real=snan option, this will set all scalar floating point variables to a NaN when they are declared.

Note

This does not protect integers

As the variable will be a NaN we can use the floating point exceptions mechanism to track them down, so follow the instructions for debugging floating point exceptions.

Arrays of floating point numbers

These can also be protected by wrapping the array after its declared with:

call fill_with_NaNs(array)

use utils_lib, only:: fill_with_NaNs
real(dp),dimension(:),allocatable :: x

allocate(x(1:10))
call fill_with_NaNs(x)

Note

Never try to fill the array yourself with a NaN, say by setting each element to 1/0. We want to track down the location where an array element is being used before being set, not the point where we set the element to a NaN. If you set the array yourself the fpe checking will warn you about that location, not where the array is being before being set. fill_with_NaNs does some clever tricks to set each element to the bit pattern that corresponds to a NaN, which does not trigger the fpe checks.

Debugging

General Techniques

Navigating through source code

Using a debugger

Using pgstar

Using a profiler

Diagnosing Solver Struggles

Step -1: Introduce a bug

Step 0: Notice your model has problems

Step 1: Activate debugging options

Step 2: Run the model and find the bad spot

Step 3: Check the partial derivatives

Step 5: Descend into an individual module

Advanced: Investigate the bad cell and equation in detail

Diagnosing Meshing Problems

Floating point exceptions

Step -1: Introduce a bug (again)

Step 0: Notice your code has a problem

Step 1 Diagnose the issue

Uninitialised variables

Arrays of floating point numbers

Memory leaks