# Debugging

Debugging MESA requires effort and tenacity. There is no single approach that is applicable in all circumstances. This page collects information on frequently useful tools and techniques.

## General Techniques

### Using a debugger

The MESA SDK contains the GNU debugger gdb. Detailed instructions for using `gdb`

are beyond the scope of the documentation, but it allows you to interactively set breakpoints, step through code, and inspect values.

### Using pgstar

The built in pgstar plotting is itself an invaluable a debugging tool. Something may jump out from a plot as strange behavior before it shows up as a problem in the run. You may notice that an issue begins when the model reaches certain conditions and that can provide a useful guide to the problem. After you identify an issue and configure some useful plots, consider setting `pgstar_interval = 1`

and `pause = .true.`

so that you can step through the model timestep-by-timestep.

### Using a profiler

Troubleshooting performance bugs or memory leaks benefits from specialized tools (see Profiling).

## Diagnosing Solver Struggles

Identifying when the solver is having difficulties and understanding the root cause is an important subset of MESA debugging. When you see a MESA model experiencing problems (e.g., taking unexpectedly small timesteps due to many retries), you should at the least report the issue to the developers’ list and ideally investigate it a bit yourself. Paying attention to such problems has proved a valuable way of identifying errors and inconsistencies within MESA.

Diagnosing solver struggles primarily relies on internal MESA tooling. The most common cause of solver struggles is incorrect/inaccurate derivatives. This document will illustrate how to go from a failing model down into the depths of the code in order to identify and fix the problem.

### Step -1: Introduce a bug

For the purpose of this tutorial, we will introduce a bug in MESA. In more typical circumstances, you or someone else will have kindly already taken care of this step.

Navigate to the subroutine `Get_kap_Results`

in `kap/private/kap_eval.f90`

.
After the call to `Get_kap_Results_blend_T`

, add

```
dlnkap_rad_dlnRho = -dlnkap_rad_dlnRho
```

thereby making the derivatives of the radiative opacity with respect to density incorrect.

After you’ve make this change, in `kap`

do

```
./mk
./export
```

to recompile and export the module.

### Step 0: Notice your model has problems

There are many ways in which a MESA model may have problems. It might
fail to run with gold tolerances or it might take unexpectedly small
timesteps. The timestep is often driven downwards by repeated
retries. Information about the retry trigger is printed to the
terminal. Or, if the solver requires more iterations than a specified
limit (`solver_iters_timestep_limit`

and related controls), it also
reduces the timestep and indicates the `dt_limit`

as `solver iters`

.

Note

The debugging steps that follow are not generally applicable when the solver is performing well. For other kinds of bugs (e.g., an option doesn’t produce the expected behavior) one often has clearer idea where the problem may be and so you can head straight to that part of the code.

We will use the `1.3M_ms_high_Z`

test suite case to illustrate the
effect of the bug that we introduced.

In r14460, this case took 377 steps, required 0 retries, and there
were a total of 1850 iterations of the newton solver. With the bug,
this test case stops when it reaches `max_model_number`

after 500
steps, having done 12 retries and a total of 8145 newton iterations.

Navigate to this test (in `star/test_suite`

) and run it. In the
terminal output, you can see that the timesteps are often set by
`solver iters`

.

```
step lg_Tmax Teff lg_LH lg_Lnuc Mass H_rich H_cntr N_cntr Y_surf eta_cntr zones retry
lg_dt_yr lg_Tcntr lg_R lg_L3a lg_Lneu lg_Mdot He_core He_cntr O_cntr Z_surf gam_cntr iters
age_yr lg_Dcntr lg_L lg_LZ lg_Lphoto lg_Dsurf C_core C_cntr Ne_cntr Z_cntr v_div_cs dt_limit
__________________________________________________________________________________________________________________________________________________
10 7.186162 6026.290 0.310381 0.310381 1.300000 1.300000 0.639691 0.003250 0.320000 -2.349937 805 0
4.820702 7.186162 0.119685 -44.790170 -1.003419 -99.000000 0.000000 0.320132 0.018721 0.040000 0.071396 16
8.4252E+05 1.739919 0.314263 -14.875409 -99.000000 -7.019694 0.000000 0.005826 0.004199 0.040176 0.000E+00 solver iters
rel_E_err 5.8461078509513597D-13
log_rel_run_E_err -11.0597023512447645
13 7.185990 6017.566 0.300802 0.300802 1.300000 1.300000 0.639670 0.003358 0.320000 -2.355541 805 0
4.873378 7.185990 0.113850 -44.803173 -1.013273 -99.000000 0.000000 0.320139 0.018721 0.040000 0.071285 15
1.0438E+06 1.737310 0.300077 -14.750470 -99.000000 -7.010156 0.000000 0.005733 0.004199 0.040192 0.000E+00 varcontrol
rel_E_err 7.3563886572871659D-13
log_rel_run_E_err -10.9707092147706380
save LOGS/profile2.data for model 13
20 7.185235 5994.060 0.274258 0.274258 1.300000 1.300000 0.639610 0.003646 0.320000 -2.362919 806 0
4.851161 7.185235 0.101141 -44.848473 -1.043016 -99.000000 0.000000 0.320157 0.018721 0.040000 0.071188 18
1.5910E+06 1.733084 0.267860 -15.352530 -99.000000 -6.988274 0.000000 0.005486 0.004199 0.040233 0.000E+00 solver iters
rel_E_err 5.1727204553814077D-13
log_rel_run_E_err -10.1566890117973099
```

### Step 1: Activate debugging options

There are standard inlist commands available that can reveal more about
the struggle that the solver is going through. They are stored in the
aptly-named `debugging_stuff_for_inlists`

which lives in
`star/test_suite`

directory. The full list of commands there is

```
!photo_interval = 1
!profile_interval = 1
!history_interval = 1
!terminal_interval = 1
! FOR DEBUGGING
!report_solver_progress = .true. ! set true to see info about solver iterations
!report_ierr = .true. ! if true, produce terminal output when have some internal error
!stop_for_bad_nums = .true.
!trace_evolve = .true.
!fill_arrays_with_NaNs = .true.
!solver_save_photo_call_number = 0
! Saves a photo when solver_call_number = solver_save_photo_call_number - 1
! e.g., useful for testing partials to set solver_call_number = solver_test_partials_call_number - 1
!solver_test_partials_call_number = 4142
!solver_test_partials_k = 2616
!solver_test_partials_iter_number = 4
!solver_test_partials_dx_0 = 1d-6
!solver_test_partials_var_name = 'o16' ! 'all' or 'lnd', 'lnT', 'lnR', 'L', 'v', etc. '' means code sets
!solver_test_partials_equ_name = 'lnP' ! 'all' or 'dlnE_dt', 'dlnd_dt', 'dlnR_dt', 'equL', etc '' means code sets
!solver_test_partials_sink_name = 'si28' ! iso name to use for "sink" to keep sum = 1
!solver_test_partials_show_dx_var_name = 'h1'
! equ name can also be one of these
! 'lnE', 'lnP', 'grad_ad' to test eos
! 'eps_nuc' to test net
! 'non_nuc_neu' to test neu
! 'gradT', 'mlt_vc' to test mlt
! 'opacity' to test kap
!solver_test_partials_write_eos_call_info = .true.
!solver_test_partials_k_low = -1
!solver_test_partials_k_high = -1
!solver_test_eos_partials = .true.
!solver_test_kap_partials = .true.
!solver_test_net_partials = .true.
!solver_test_atm_partials = .true.
!report_all_dt_limits = .true.
!report_solver_dt_info = .true.
!show_mesh_changes = .true.
!mesh_dump_call_number = 5189
!okay_to_remesh = .false.
!energy_conservation_dump_model_number = -1
!use_DGESVX_in_bcyclic = .true.
!use_equilibration_in_DGESVX = .true.
!report_min_rcond_from_DGESXV = .true.
! solver debugging
!solver_check_everything = .true.
!solver_epsder_struct = 1d-6
!solver_epsder_chem = 1d-6
!report_solver_dt_info = .true.
!report_dX_nuc_drop_dt_limits = .true.
!report_bad_negative_xa = .true.
```

Copy the contents of `debugging_stuff_for_inlists`

to the controls
section of your inlist. Note that many test suite inlists already
contain these options.

Begin by uncommenting the following lines to get info about the newton solver progress:

```
report_solver_progress = .true. ! set true to see info about newton iterations
report_ierr = .true. ! if true, produce terminal output when have some internal error
```

### Step 2: Run the model and find the bad spot

Running the model again will now produce a lot more output, giving information about each and every call to the solver. That output will look something like this:

```
solver_call_number 17
gold tolerances level 2
correction tolerances: norm, max 3.0000000000000001D-05 3.0000000000000001D-03
tol1 residual tolerances: norm, max 9.9999999999999994D-12 1.0000000000000001D-09
17 1 coeff 1.0000 avg resid 0.518E-05 max resid dv_dt 1 0.111E-01 mix type xx000 avg corr 0.474E-03 max corr lnd 556 0.710E-02 mix type 00000 avg+max corr+resid
17 2 coeff 1.0000 avg resid 0.561E-06 max resid dv_dt 1 0.275E-02 mix type xx000 avg corr 0.140E-03 max corr lnd 493 0.140E-02 mix type 11100 avg corr, avg+max resid
17 3 coeff 1.0000 avg resid 0.384E-06 max resid dv_dt 1 0.521E-02 mix type xx000 avg corr 0.262E-04 max corr lnd 494 0.253E-03 mix type 11000 avg+max resid
17 4 coeff 1.0000 avg resid 0.681E-06 max resid dv_dt 1 0.819E-02 mix type xx000 avg corr 0.560E-04 max corr lnd 494 -0.555E-03 mix type 11000 avg corr, avg+max resid
tol2 residual tolerances: norm, max 1.0000000000000000D-08 1.0000000000000001D-05
17 5 coeff 1.0000 avg resid 0.104E-05 max resid dv_dt 1 0.121E-01 mix type xx000 avg corr 0.780E-04 max corr lnd 494 0.758E-03 mix type 11000 avg corr, avg+max resid
17 6 coeff 1.0000 avg resid 0.159E-05 max resid dv_dt 1 0.177E-01 mix type xx000 avg corr 0.113E-03 max corr lnd 494 -0.110E-02 mix type 11000 avg corr, avg+max resid
17 7 coeff 1.0000 avg resid 0.251E-05 max resid dv_dt 1 0.263E-01 mix type xx000 avg corr 0.166E-03 max corr lnd 494 0.161E-02 mix type 11000 avg corr, avg+max resid
17 8 coeff 1.0000 avg resid 0.403E-05 max resid dv_dt 1 0.383E-01 mix type xx000 avg corr 0.246E-03 max corr lnd 494 -0.238E-02 mix type 11000 avg corr, avg+max resid
17 9 coeff 0.5000 avg resid 0.185E-05 max resid dv_dt 1 0.936E-02 mix type xx000 avg corr 0.359E-03 max corr lnd 494 0.348E-02 mix type 11000 avg+max corr+resid
17 10 coeff 0.5000 avg resid 0.777E-06 max resid dv_dt 1 0.227E-02 mix type xx000 avg corr 0.857E-04 max corr lnd 494 -0.827E-03 mix type 11000 avg corr, avg+max resid
17 11 coeff 1.0000 avg resid 0.262E-06 max resid dv_dt 1 0.329E-02 mix type xx000 avg corr 0.226E-04 max corr lnd 494 0.220E-03 mix type 11000 avg+max resid
17 12 coeff 0.5000 avg resid 0.660E-07 max resid dv_dt 1 0.754E-03 mix type xx000 avg corr 0.306E-04 max corr lnd 494 -0.296E-03 mix type 11000 avg corr, avg+max resid
17 13 coeff 0.5000 avg resid 0.190E-07 max resid dv_dt 1 0.185E-03 mix type xx000 avg corr 0.712E-05 max corr lnd 494 0.691E-04 mix type 11000 avg+max resid
17 14 coeff 0.5000 avg resid 0.540E-08 max resid dv_dt 1 0.409E-04 mix type xx000 avg corr 0.172E-05 max corr lnd 494 -0.166E-04 mix type 11000 avg+max resid
17 15 coeff 0.5000 avg resid 0.200E-08 max resid dv_dt 1 0.106E-04 mix type xx000 avg corr 0.382E-06 max corr lnd 494 0.369E-05 mix type 11000 max resid
17 16 coeff 0.5000 avg resid 0.754E-09 max resid dv_dt 1 0.216E-05 mix type xx000 avg corr 0.103E-06 max corr lnd 494 -0.100E-05 mix type 11000 okay!
```

The first line gives `solver_call_number`

, which uniquely identifies
a particular solver call. (It will differ from the model number after
retries have occurred.) Then, there is an indication of what set of
solver tolerances are being used and what those tolerances are. After
that, there is a series of lines that gives iteration-by-iteration
summaries of what the solver did. The iteration number is the second
column. After certain numbers of iterations (user-defined, see e.g.,
`gold_iter_for_resid_tol2`

), the solver relaxes the residual
tolerances and the new tolerances are indicated.

Other columns show information about the residuals and corrections.
The “residual” is the left over difference between the left and right
hand sides of the equation we are trying to solve. We do iterations to
reduce that, but we are limited by the non-linearity of the problem
and the quality of the estimates for the derivatives. The
“correction” is the change in the primary variable, calculated using
Newton’s method, so the Jacobian and residuals give a correction that
would make the next residual vanish if the problem were linear and the
Jacobian were exact, neither of which are true. So the best we can
hope for is that the corrections will get smaller next iteration.
When the maximum residual/correction is shown, the equation/variable
and its cell index are indicated. The mixing types for the cells
around these locations are also indicated. The 1’s are the convection
and the 0’s are for no-mixing. (The full list of mixing types is in
`const_def`

.)

The last column is the simplest to understand. It tells us that for the early iterations, the solution was rejected because most everything was unsatisfactory. Then, as the iterations proceed and the residuals and corrections shrink, eventually only the maximum residual is unsatisfactory. Finally, the solver decides the solution is good enough.

Note

Less artificial bugs may behave somewhat differently than this toy problem. In this example, the solver is pretty much always struggling, so you could pick almost any step as the bad one. In practice, models often only experience problems under specific conditions and so a much smaller subset of steps are bad. Generally, you want to pick a step just as things start to go wrong. In this example, the solver also eventually succeeded. In practice, one often see cases where the newton iterations will completely fail to drive the max residual below the tolerances and the solver will give up, forcing a retry and reducing the timestep.

### Step 3: Check the partial derivatives

From the terminal output in the previous part, we can see that the max
residual of `dv_dt`

equation is difficult to reduce and is located
in cell 1.

We want to find the source of this bad residual by checking
correctness of the partial derivatives of the corresponding equation.
MESA can compare the analytic derivative that it has been given with a
numerical estimate of the derivative based on finite differences and
Ridder’s method of polynomial extrapolation (aka `dfridr`

). To do
so, we must ask for the derivatives at a particular cell at a
particular iteration number of a particular solver call. This can be
achieved with the inlist options:

```
solver_test_partials_call_number = 17
solver_save_photo_call_number = 17
solver_test_partials_iter_number = 8
solver_test_partials_k = 1
solver_test_partials_equ_name = 'dv_dt'
solver_test_partials_var_name = 'all'
solver_test_partials_dx_0 = 1d-6
```

Setting `solver_save_photo_call_number = 17`

produces a
photo at the preceding step (`x016`

) and so re-running the bad step
becomes as easy as doing `./re`

. Debugging is itself an iterative
process, so you want to make sure you can rapidly and consistently
reproduce the problem. A little time invested up front letting you
easily and rapidly trigger the problem pays dividends when you have
repeat the procedure a dozen times.

Now when you do `./rn`

(or `./re`

after the photo exists) and the
model will stop after the specified solver call and you’ll see

```
equ name dv_dt
***** log dfridr rel_diff partials wrt lnd(k) of dv_dt(k) 1 0.677 log uncertainty -9.668 analytic -2.9597633549276736E-01 numeric -1.7033292388187307E+00
log dfridr rel_diff partials wrt lnd(k+1) of dv_dt(k) 1 -99.000 log uncertainty -99.000 analytic 0.0000000000000000E+00 numeric 0.0000000000000000E+00
log dfridr rel_diff partials wrt lnT(k) of dv_dt(k) 1 -10.058 log uncertainty -10.644 analytic -6.9708999322145324E+00 numeric -6.9708999316051248E+00
log dfridr rel_diff partials wrt lnT(k+1) of dv_dt(k) 1 -99.000 log uncertainty -99.000 analytic 0.0000000000000000E+00 numeric 0.0000000000000000E+00
????? log dfridr rel_diff partials wrt L(k) of dv_dt(k) 1 -4.255 log uncertainty -4.272 analytic 6.3131382934316702E-39 numeric 6.3127875526588140E-39
log dfridr rel_diff partials wrt L(k+1) of dv_dt(k) 1 -99.000 log uncertainty -99.000 analytic 0.0000000000000000E+00 numeric 0.0000000000000000E+00
log dfridr rel_diff partials wrt lnR(k) of dv_dt(k) 1 -11.110 log uncertainty -10.064 analytic -2.0080506394821409E+00 numeric -2.0080506394665569E+00
log dfridr rel_diff partials wrt lnR(k+1) of dv_dt(k) 1 -99.000 log uncertainty -99.000 analytic 0.0000000000000000E+00 numeric 0.0000000000000000E+00
```

This output shows the derivatives in the cell for each equation with
respect to each variable. The column after the cell number shows the
relative difference between the numerical (`dfridr`

) and analytic
derivatives. When things are working well, the relative differences
go down to < 1d-12, but don’t panic if you see 1d-7. That can still
work. But 1d-4 or more is a bad sign. Here, our bad derivative jumps
out with an order unity relative error and is highlighted by the
`****`

on the left side. So we now know the bad derivative is the
one with respect to density.

The relative uncertainty in the numerical estimate of the derivative
is also indicated. When this is comparable to the relative error
between the numerical and analytic derivatives, then you’re probably
not learning anything about the quality of the analytic derivative.
(See for example the line marked with `????`

above.) Sometimes that
sort of result can be a false alarm caused by the particular choice of
starting difference for the dfridr search. This is set by
`solver_test_partials_dx_0`

and is typically something around 1d-6.
Usually that works, but if you get a large err that makes no sense,
you should try both larger and smaller values of to see if the dfridr
result changes.

It is often the case that bad derivatives in an equation come from bad
derivatives in the microphysical inputs that enter into the equation.
Since this is common, MESA supports easily checking the basic `eos`

,
`kap`

, and `net`

quantities. In particular, the value of
`solver_test_partials_equ_name`

can be set to `'lnE'`

or `'lnP'`

(testing `eos`

), to `'eps_nuc'`

(testing `net`

), or
`'opacity'`

(testing `kap`

).

Apply your knowledge about which equation showed the issue. If the
problem were in the energy equation, you would probably want to test
`lnE`

and `eps_nuc`

. Since this issue appeared in the momentum
equation, we will want to test `lnP`

and since it was in the surface
cell, we will want to test `opacity`

as that enters into the
atmosphere boundary condition.

We already found the issue was in the density derivatives, so we can
now restrict our attention to that derivative by setting
`solver_test_partials_var_name = 'lnd'`

.

Setting `solver_test_partials_equ_name == 'lnP'`

and restarting
shows no problem

```
log dfridr rel_diff partials wrt lnd(k) of lnP(k) 1 -9.791 log uncertainty -12.139 analytic 9.9965278727210283E-01 numeric 9.9965278743368469E-01
log dfridr rel_diff partials wrt lnd(k+1) of lnP(k) 1 -99.000 log uncertainty -99.000 analytic 0.0000000000000000E+00 numeric 0.0000000000000000E+00
```

but `solver_test_partials_equ_name == 'opacity'`

and restarting shows

```
***** log dfridr rel_diff partials wrt lnd(k) of opacity(k) 1 0.301 log uncertainty -9.108 analytic -2.9110358120328123E-01 numeric 2.9110358102038758E-01
log dfridr rel_diff partials wrt lnd(k+1) of opacity(k) 1 -99.000 log uncertainty -99.000 analytic 0.0000000000000000E+00 numeric 0.0000000000000000E+00
```

Comparing the bad analytic derivative to the numerical one reveals the sign error that we introduced.

We’ve reached the end of the part of debugging that can be done with inlists only. We will need to edit the code in order to MESA to dive deeper into a module or to check individual terms in the equation derivatives.

As a test case maintainer, this may be the point at which you stop,
report the issue, and turn things over to others. If the issue is in
the `eos`

, you might dig in a bit more to see which component of the
eos is active by setting ```
solver_test_partials_write_eos_call_info =
.true.
```

and include that info in your report.

### Step 5: Descend into an individual module

If the bug has already been found, this step isn’t necessary, but
often one must continue deeper into a module. For the modules
`atm`

, `eos`

, `kap`

, `net`

, and `rates`

, it is possible to
continue to do some of this debugging from star. This is extremely
helpful because it means that the precise properties of the
troublesome cell from star are passed into the module.

In the inlist set:

```
solver_test_kap_partials = .true.
```

(Note that using this option automatically sets MESA to run in single-thread mode.)

Since will specify which quantity to check in the code, we should stop selecting the equation in the inlist, but can continue to specify which derivative of that quantity to check:

```
solver_test_partials_equ_name = ''
solver_test_partials_var_name = 'lnd'
```

The variables `kap_test_partials_val`

and
`kap_test_partials_dval_dx`

can now be set within `kap`

and
thereby communicate their values into star and dfridr.

At the end of `Get_kap_results`

in `kap/private/kap_eval.f90`

—
near where we introduced our bug — add

```
kap_test_partials_val = log(kap_rad)
kap_test_partials_dval_dx = dlnkap_rad_dlnRho
```

Again, the key point of this step is that we’re passing up a value
(`kap_rad`

) that we wouldn’t normally have access to from star.
This procedure allows you to debug deep within a module from star.

We changed code in `kap`

, so we need to do `./mk && ./export`

in
`star`

and then in the work directory also recompile before we
restart `./mk && ./re`

.

The output should clearly illustrate that this derivative has the
wrong sign. Before making a change that corrects the value of
`dlnkap_rad_dlnRho`

, consider putting what you think the right
answer is in `kap_test_partials_dval_dx`

and re-running to check
that your new expression is OK. (If you first fix the bad derivative
at the location of the bug, that change will mean that the newton
iterations won’t be the same any more and you may not hit the same
conditions and or trigger `dfridr`

.)

After you’ve checked, make the bug fix, turn off the debugging options, run your model, and hopefully see improved performance.

### Advanced: Investigate the bad cell and equation in detail

Note

We will now pretend we didn’t find the bug and re-use this setup to illustrate how to dig in at an even deeper level.

MESA equations and structure variables have integer indices. The full
list is in `star_data/public/star_data.inc`

, but the most common
ones are

```
! index definitions for the variables (= 0 if variable not in use)
integer :: i_lnd ! ln(cell density average by mass)
integer :: i_lnT ! ln cell temperature average by mass
integer :: i_lnR ! ln radius at outer face of cell
integer :: i_lum ! luminosity at outer face of cell
integer :: i_v ! Lagrangian velocity at outer face of cell
integer :: i_u ! Lagrangian velocity at center of cell
integer :: i_lnPgas ! ln cell gas pressure average by mass
! index definitions for the equations (= 0 if equation not in use)
integer :: i_dlnR_dt ! dlnR/dt = v/r
integer :: i_dv_dt ! momentum conservation using v
integer :: i_du_dt ! momentum conservation using u
integer :: i_dlnd_dt ! mass conservation
integer :: i_dlnE_dt ! energy conservation
integer :: i_equL ! luminosity
```

In Step 3, we found that we want to investigate the `lnd`

derivative
of the `dv_dt`

equation in cell `k = 1`

. If we hadn’t been able
to so easily narrow things down to `kap`

, then we would need to make
a more detailed investigation of the equation in this cell.

To do so, go to `star/private/hydro_eqns.f90`

and follow the flow
subroutine `eval_equ_for_solver`

until you reach the equation that
you want investigate.

The surface cell (`k = 1`

) is special and so we see that the
`dv_dt`

equation for this cell is provided by `PT_eqns_surf`

. Go
to that subroutine. Continue to follow the logic until you get the
place that defines the residual of the desired equation. In this
case, that is yet one more level down in `set_Psurf_BC`

.

There you’ll see the code that indicates that we’ve reached the
`dv_dt`

equation in cell 1.

```
if (s% u_flag) then
i_eqn = i_du_dt
else
i_eqn = i_dv_dt
end if
equ(i_eqn,1) = lnP_bc - s% lnP(1)
```

This is an extremely simple equation. By driving the residual to zero, MESA requires that the surface cell pressure match the imposed boundary condition.

Now, we want to tell MESA to test the partial derivatives. Near the start of the routine, set

```
test_partials = (s% solver_iter == s% solver_test_partials_iter_number)
!test_partials = .false.
```

Almost all the equation routines will already have `test_partials`

code like this. And once you reach one routine that has
`test_partials`

, most of the routines it calls are usually also set
up to check partials. (But if you are in uncharted territory you
might have to add them yourself.)

Note

Since `PT_eqns_surf`

and `set_Psurf_BC`

are only called for
`k=1`

, we do not test for zone number. In other equations that
are evaluated for multiple `k`

, the analogous test will usually
look like

```
test_partials = (k == s% solve_test_partials_k .and. s% solver_iter == s% solver_test_partials_iter_number)
!test_partials = .false.
```

Set `solver_test_partials_val`

to a variable that is used in
calculating the equation residual. Let’s begin with the residual
itself, which should reproduce the bad entry from the previous step in
this tutorial.

```
if (test_partials) s% solver_test_partials_val = equ(i_eqn,1)
```

Find the call on `e00`

for `i_lnd`

to find the relevant partial with respect to density.

```
call e00(s, xscale, i_eqn, i_lnd, 1, nvar, dlnP_bc_dlnd - s% chiRho_for_partials(1))
```

Note

Recall the MESA Jacobian has a block tridiagonal structure (Fig. 47
in MESA II). The `e00`

idiom refers to forming the block
corresponding to the derivatives of the equations in cell `k`

with respect to the variables in cell `k`

. Similarly, `em1`

and `ep1`

refer to the the derivatives of the equations in cell
`k`

with respect to the variables in cell `k-1`

and in cell
`k+1`

respectively. The 3rd and 4th arguments to these
subroutines indicate that this call is setting the derivative of a
specific equation (recall `i_eqn = i_dv_dt`

earlier) with respect
to a specific variable (here `i_lnd`

).

Set `solver_test_partials_var`

to the thing you want to take a
derivative with respect to and `solver_test_partials_dval_dx`

to the
analytic derivative.

```
if (test_partials) then
s% solver_test_partials_var = i_lnd
s% solver_test_partials_dval_dx = dlnP_bc_dlnd - s% chiRho_for_partials(1)
write(*,*) 'set_Psurf_BC', s% solver_test_partials_var
end if
```

We changed code in star, so we need to do `./mk && ./export`

in
`star`

and then in the work directory also recompile before we
restart `./mk && ./re`

. We will repeat this recompilation/restart
process many times, so do it in a way that makes it easy for you do
over and over.

Since we are now specifying which derivative to check in the code, we must also stop selecting the equations/variables in the inlist.

```
solver_test_partials_equ_name = ''
solver_test_partials_var_name = ''
solver_test_kap_partials = .false.
```

After that edit, once you recompile and restart you should see

```
set_Psurf_BC 1
analytic and numeric partials wrt lnd -2.9597633549276736D-01 -1.7033292388187307D+00
log dfridr relative uncertainty for numeric partial -9.6678068357602811D+00
rel_diff 4.7549507665296238D+00
STOP done solver_test_partials
```

Again, this looks bad! In the final line, we can see that the analytic estimate of the derivative (what goes in the Jacobian) is order-unity different from the numerically evaluated derivative.

Now that we’ve confirmed which derivative is the problem, we must look
for which term or terms are responsible. There are only two in this
equation, `lnP_bc - s% lnP(1)`

, so you can test them one by one

```
s% solver_test_partials_val = lnP_bc
```

and

```
s% solver_test_partials_var = i_lnd
s% solver_test_partials_dval_dx = dlnP_bc_dlnd
```

Recompile and restart to confirm that this is the bad term.

You can repeat this with the other term to confirm that `lnP(1)`

and
its derivative `s% chiRho_for_partials(1)`

are OK. If they weren’t,
then it would be off to the EOS to figure out why. We’ll discuss how
to dive deeply into a specific module in the next step.

Once you locate the bad term, find where that term is defined, so that
you can investigate what is causing it to be in error. On our way
down, we saw that `dlnP_bc`

and its derivatives are set in
`PT_eqns_surf`

. So now turn off the partial tests in
`set_Psurf_BC`

, switch to `PT_eqns_surf`

, turn on the partial
tests there.

Note

When you finish testing a partial in one place, don’t forget to
turn off the test in that place before moving on to the new place.
Typically the code that reports results by setting
`solver_test_partials_val`

, `solver_test_partials_var`

, and
`solver_test_partials_dval_dx`

, will also have a write that says
who it is. This can be a lifesaver if/when you do forget to turn
off one routine before turning on another. The terminal output
will show the problem and tell you where it is.

We can start by testing the partial of `lnP_bc`

to confirm our
previous finding. (Sample code omitted so you can practice setting
the variables yourself.) That should give

```
set_Psurf_BC 1
analytic and numeric partials wrt lnd 7.0157923462638694D-01 -7.0157923451592796D-01
dfridr relative uncertainty for numeric partial -1.6481349835630704D-12
rel_diff 2.0000000001574434D+00
STOP done solver_test_partials
```

and at this point you can clearly see the sign error we introduced. With that confirmed, we want to start dissecting

```
dlnP_bc_dlnd = dlnP_bc_dlnPsurf*dlnPsurf_dlnkap*dlnkap_dlnd
```

For more complex expressions involving many terms, you might need to test term by term.

Note

The `dfridr`

machinery in MESA tests the derivatives with
respect to structure variables, so it can’t magically check every
derivative. To check `dlnP_bc_dlnPsurf`

you can look at the
expression relating `lnP_bc`

and `lnP_surf`

in this routine
and confirm it with pencil and paper. To check
`dlnPsurf_dlnkap`

you would have to follow the code down into
`atm`

. Once there, you could check by hand or set up your own
`dfridr`

code that passes in varying values of `kap`

and
exercises the module independent of star.

Here, since there’s only one `dlnd`

derivative, `dlnkap_dlnd`

is
the clear place to start (even if we didn’t already know it was the
issue). Set

```
s% solver_test_partials_val = log(s% opacity(1))
s% solver_test_partials_var = s% i_lnd
s% solver_test_partials_dval_dx = dlnkap_dlnd
```

and `dfridr`

should confirm that the bad derivative is the density
derivative of the opacity.

The surface cell momentum equation is one of the simpler equations.
Two other types of derivatives often arise: derivatives with respect
to quantities in neighboring cells and derivatives with respect to
composition. The energy conservation equation
(`get1_energy_equation`

in `hydro_equ_l.f90`

) provides a good
example of an equation with both these features.

When testing a derivative like `d_esum_dlnTm1`

, that is a partial of
a quantity `esum(k)`

with respect to `lnT(k-1)`

, note that the
`dfridr`

code perturbs the structure of cell ```
k =
solver_test_partials_k
```

. That is, it can test the partials of
equations in cell `k-1`

, `k`

, and `k+1`

with respect to the
structure variables in cell `k`

. (In practice if there is a problem
with partials of a quantity in cell `k`

then there are also problems
in cells `k+1`

and `k-1`

. But if you really need to test one of
these derivatives in a specific cell, you can always adjust
`solver_test_partials_k`

in the inlist.)

Therefore, to check `d_esum_dlnTm1`

one can do

```
test_partials = (k-1 == s% solver_test_partials_k .and. s% solver_iter == s% solver_test_partials_iter_number)
!test_partials = .false.
```

and

```
s% solver_test_partials_val = esum
s% solver_test_partials_var = s% i_lnT
s% solver_test_partials_dval_dx = d_esum_dlnTm1
```

This will check the derivative of `esum`

in cell
`solver_test_partials_k+1`

with respect to `lnT`

in cell
`solver_test_partials_k`

.

Checking composition derivatives requires setting a sink isotope such that the dfridr perturbations to the chosen isotope are offset by changes in a “sink” isotope to keep the mass fraction sum = 1. The sink needs to be inert in the current situation and have enough abundance to absorb the changes.

When debugging from via the inlist, this isotope can be specified
using the control `solver_test_partials_sink_name`

. So for example,
addition to setting the usual k, call_number, iter_number, and dx_0,
to check partials of the energy equation with respect he4, you might
set the following

```
solver_test_partials_equ_name = 'dlnE_dt'
solver_test_partials_var_name = 'he4'
solver_test_partials_sink_name = 'fe56'
```

If `solver_test_partials_dx_0`

is too large, you’ll get an error if
the mg24 mass fraction is too small to deal with the desired increase
in the targeted isotope.

When you need to check composition derivatives of specific terms, you
must pick the various `test_partials`

values in the source code.
This can be tricky in that in some places the abundance is given as an
index in xa and in others as a variable number for the solver. This
offset is `s% nvar_hydro`

, that is the index of the first chem
equation is `s% nvar_hydro + 1`

.

If you set `solver_test_partials_write_eos_call_info = .true.`

, the
output includes also includes composition information like the
following so it is easy to find the needed numbers.

```
xa(j,k) neut 1 6 3312 1.0010960558047912D-99
xa(j,k) h1 2 7 3312 0.0000000000000000D+00
xa(j,k) prot 3 8 3312 1.0010960558047908D-99
xa(j,k) he3 4 9 3312 0.0000000000000000D+00
xa(j,k) he4 5 10 3312 6.7423592032960378D-01
xa(j,k) c12 6 11 3312 1.0630746138489515D-01
xa(j,k) n14 7 12 3312 6.1988280539418319D-03
xa(j,k) o16 8 13 3312 2.0390257320339500D-01
xa(j,k) ne20 9 14 3312 5.7660956389211995D-03
...
xa(j,k) fe56 20 25 3312 1.4153093511237100D-03
```

So then the in the code you would do something like

```
if (test_partials) then
s% solver_test_partials_dx_sink = 20 ! fe56 (index in xa)
s% solver_test_partials_var = 10 ! he4 (variable number)
s% solver_test_partials_dval_dx = s% d_epsnuc_dx(5,k) ! he4 (index in xa)
write(*,*) 'get1_energy_eqn', s% solver_test_partials_var
end if
```

## Diagnosing Meshing Problems

Note

This was the file DEBUG_mesh.notes. It needs modernized.

to get info about why the mesh is doing what it is doing,

in controls section of inlist:

```
show_mesh_changes = .true.
```

run to get current mesh call number, n

in controls section of inlist:

```
mesh_dump_call_number = n
```

run to get plot data

mesh_plan.rb shows input data that was used to form the plan

mesh.rb shows new mesh (with some old values interpolated to the new mesh for comparison)

## Floating point exceptions

Floating point exceptions (fpe) occur when the code attempts to perform an illegal math operation, for instance x/0, sqrt(-1), or log(0). In these cases any number that uses the result of one of these expressions is also undefined.

To test MESA for floating point exceptions we can turn on the compiler checks that will flag when an issue occurs:

Set the environment variable

```
MESA_FPE_CHECKS_ON=1
```

Then run ./clean and ./install in MESA_DIR.

This setting adds the following options in `utils/makefile_header`

```
FCbasic += $(FCtrapNANs) -fbounds-check -finit-derived -Wuninitialized -Warray-bounds -fstack-protector-all -D_FORTIFY_SOURCE=2
FCopt = -O0 -ftree-vectorize
SKIP_NAN_TRAPS = NO
```

and acts as if you set the controls inlist option

```
fill_arrays_with_nans = .true.
```

to make sure you catch any uninitialized array accesses.

### Step -1: Introduce a bug (again)

Lets add a fpe in a test case to see what happens:

Alter star/test_suite/1.3M_ms_high_Z/src/run_star_extras.f90 and in extras_finish_step add:

```
write(*,*) log(-1d0*s% model_number)
```

### Step 0: Notice your code has a problem

Run ./mk and ./rn to run the test case and you’ll get the following output:

```
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7efca94096af in ???
#1 0x7efca9cc6330 in ???
#2 0x405f10 in __run_star_extras_MOD_extras_finish_step
at ../src/run_star_extras.f90:155
#3 0x425372 in __run_star_support_MOD_after_step_loop
at ../job/run_star_support.f90:634
#4 0x427d63 in __run_star_support_MOD_run1_star
at ../job/run_star_support.f90:162
#5 0x407076 in __run_star_MOD_do_run_star
at ../../../../star/job/run_star.f90:26
#6 0x4070eb in run
at ../src/run.f90:13
#7 0x407137 in main
at ../src/run.f90:2
./rn1: line 6: 2166962 Floating point exception(core dumped) ./star
```

### Step 1 Diagnose the issue

The important line is the first line that does not include ???, in this case it is

```
#2 0x405f10 in __run_star_extras_MOD_extras_finish_step
at ../src/run_star_extras.f90:155
```

This is telling us that the problem occurs at line 155 in ../src/run_star_extras.f90

Note

The ../src is because the file path is relative to the make/ folder in your work folder

At this point you should go to that line and work out why the number is undefined.

## Uninitialised variables

Uninitialised variables occur when we use a variable before it has been defined, for example

```
real(dp) :: x,y
y = 2 * x
```

If you are lucky then x will be zero, if you are unlucky it can be any random number that happened to be in the memory location that x occupies. This can lead to results that change either each time the model runs or gives different values between different machines.

By default MESA compiles with the -finit-real=snan option, this will set all scalar floating point variables to a NaN when they are declared.

Note

This does not protect integers

As the variable will be a NaN we can use the floating point exceptions mechanism to track them down, so follow the instructions for debugging floating point exceptions.

### Arrays of floating point numbers

These can also be protected by wrapping the array after its declared with:

call fill_with_NaNs(array)

```
use utils_lib, only:: fill_with_NaNs
real(dp),dimension(:),allocatable :: x
allocate(x(1:10))
call fill_with_NaNs(x)
```

Note

Never try to fill the array yourself with a NaN, say by setting each element to 1/0. We want to track down the location where an array element is being **used** before being set, not the point where we set the element to a NaN. If you set the array yourself the fpe checking will warn you about that location, not where the array is being before being set. fill_with_NaNs does some clever tricks to set each element to the bit pattern that corresponds to a NaN, which does not trigger the fpe checks.