Skip to content
Snippets Groups Projects
paper.md 33.7 KiB
Newer Older
Bas Nijholt's avatar
Bas Nijholt committed
---
title:  'Adaptive, tools for adaptive parallel sampling of mathematical functions'
journal: 'PeerJ'
author:
- name: Tinkerer
  affiliation:
    - Kavli Institute of Nanoscience, Delft University of Technology, P.O. Box 4056, 2600 GA Delft, The Netherlands
  email: not_anton@antonakhmerov.org
abstract: |
Bas Nijholt's avatar
Bas Nijholt committed
  Large scale computer simulations are time-consuming to run and often require sweeps over input parameters to obtain a qualitative understanding of the simulation output.
  These sweeps of parameters can potentially make the simulations prohibitively expensive.
  Therefore, when evaluating a function numerically, it is advantageous to sample it more densely in the interesting regions (called adaptive sampling) instead of evaluating it on a manually-defined homogeneous grid.
  Such adaptive algorithms exist within the machine learning field.
  These methods can suggest a new point to calculate based on *all* existing data at that time; however, this is an expensive operation.
Bas Nijholt's avatar
Bas Nijholt committed
  An alternative is to use local algorithms---in contrast to the previously mentioned global algorithms---which can suggest a new point, based only on the data in the immediate vicinity of a new point.
  This approach works well, even when using hundreds of computers simultaneously because the point suggestion algorithm is cheap (fast) to evaluate.
Bas Nijholt's avatar
Bas Nijholt committed
  We provide a reference implementation in Python and show its performance.
Bas Nijholt's avatar
Bas Nijholt committed
acknowledgements: |
  We'd like to thank ...
contribution: |
  Bla
...

# Introduction

Bas Nijholt's avatar
Bas Nijholt committed
#### Simulations are costly and often require sampling a region in parameter space.
Bas Nijholt's avatar
Bas Nijholt committed
In the computational sciences, one often does costly simulations---represented by a function $f$---where a certain region in parameter space $X$ is sampled, mapping to a codomain $Y$: $f \colon X \to Y$.
Bas Nijholt's avatar
Bas Nijholt committed
Frequently, the different points in $X$ can be independently calculated.
Even though it is suboptimal, one usually resorts to sampling $X$ on a homogeneous grid because of its simple implementation.

#### Choosing new points based on existing data improves the simulation efficiency.
<!-- This should convey the point that it is advantageous to do this. -->
An alternative, which improves the simulation efficiency, is to choose new potentially interesting points in $X$, based on existing data [@Gramacy2004; @Figueiredo1995; @Castro2008; @Chen2017].
Bas Nijholt's avatar
Bas Nijholt committed
Bayesian optimization works well for high-cost simulations where one needs to find a minimum (or maximum) [@Takhtaganov2018].
However, if the goal of the simulation is to approximate a continuous function using the fewest points, the continuity of the approximation is achieved by a greedy algorithm that samples mid-points of intervals with the largest distance or curvature [@Wolfram2011].
Bas Nijholt's avatar
Bas Nijholt committed
Such a sampling strategy (i.e., in Fig. @fig:algo) would trivially speedup many simulations.
Bas Nijholt's avatar
Bas Nijholt committed
Here, the complexity arises when parallelizing this algorithm because this requires a lot of bookkeeping and planning.
Bas Nijholt's avatar
Bas Nijholt committed
![Visualization of a 1-D point choosing algorithm for a black-box function (grey).
We start by calculating the two boundary points.
Two consecutive existing data points (black) $\{x_i, y_i\}$ define an interval.
Bas Nijholt's avatar
Bas Nijholt committed
Each interval has a loss $L_{i,i+1}$ associated with it that can be calculated from the points inside the interval $L_{i,i+1}(x_i, x_{i+1}, y_i, y_{i+1})$ and optionally of $N$ next nearest neighbouring intervals.
Bas Nijholt's avatar
Bas Nijholt committed
At each iteration the interval with the largest loss is indicated (red), with its corresponding candidate point (green) picked in the middle of the interval.
The loss function in this example is the curvature loss.
Bas Nijholt's avatar
Bas Nijholt committed
](figures/algo.pdf){#fig:algo}

Bas Nijholt's avatar
Bas Nijholt committed
#### We describe a class of algorithms relying on local criteria for sampling, which allow for easy parallelization and have a low overhead.
Bas Nijholt's avatar
Bas Nijholt committed
To handle many parallel workers that calculate the function values and request new points, the algorithm needs to have a low computational overhead.
Requiring that, when a new point has been calculated, that the information updates are local (only in a region around the newly calculated point), will reduce the time complexity of the algorithm.
Bas Nijholt's avatar
Bas Nijholt committed
A simple example is greedily optimizing continuity of the sampling by selecting points according to the distance to the largest gaps in the function values, as in Fig. @fig:algo.
Bas Nijholt's avatar
Bas Nijholt committed
For a one-dimensional function with three points known (its boundary points and a point in the center), such a simple algorithm consists of the following steps:
(1) keep all points $x$ sorted, where two consecutive points define an interval,
Bas Nijholt's avatar
Bas Nijholt committed
(2) calculate the distance for each interval $L_{i, i+1}=\sqrt{(x_{i+1}-x_{i})^{2}+(y_{i+1}-y_{i})^{2}}$,
Joseph Weston's avatar
Joseph Weston committed
(3) pick a new point $x_\textrm{new}$ in the middle of the interval with the largest $L$, creating two new intervals around that point,
(4) calculate $f(x_\textrm{new})$,
(5) repeat the previous steps, without redoing calculations for unchanged intervals.
Bas Nijholt's avatar
Bas Nijholt committed

Bas Nijholt's avatar
Bas Nijholt committed
In this paper, we present a class of algorithms that rely on local criteria for sampling, such as in the above example.
Bas Nijholt's avatar
Bas Nijholt committed
Here we associate a *local loss* to each interval and pick a *candidate point* inside the interval with the largest loss.
For example, in the case of the integration algorithm, the loss is the error estimate.
The advantage of these *local* algorithms is that they allow for easy parallelization and have a low computational overhead.
Bas Nijholt's avatar
Bas Nijholt committed
![Comparison of homogeneous sampling (top) with adaptive sampling (bottom) for different one-dimensional functions (red) where the number of points in each column is identical.
Bas Nijholt's avatar
Bas Nijholt committed
We see that when the function has a distinct feature---such as with the peak and tanh---adaptive sampling performs much better.
Bas Nijholt's avatar
Bas Nijholt committed
When the features are homogeneously spaced, such as with the wave packet, adaptive sampling is not as effective as in the other cases.](figures/Learner1D.pdf){#fig:Learner1D}
Bas Nijholt's avatar
Bas Nijholt committed
![Comparison of homogeneous sampling (top) with adaptive sampling (bottom) for different two-dimensional functions where the number of points in each column is identical.
Joseph Weston's avatar
Joseph Weston committed
On the left is the function $f(x) = x + a ^ 2 / (a ^ 2 + (x - x_\textrm{offset}) ^ 2)$.
Bas Nijholt's avatar
Bas Nijholt committed
In the middle a topological phase diagram from \onlinecite{Nijholt2016}, where the function can take the values -1 or 1.
On the right, we plot level crossings for a two-level quantum system.
Joseph Weston's avatar
Joseph Weston committed
In all cases using Adaptive results in a higher fidelity plot.
Bas Nijholt's avatar
Bas Nijholt committed
](figures/Learner2D.pdf){#fig:Learner2D}
Bas Nijholt's avatar
Bas Nijholt committed
#### We provide a reference implementation, the Adaptive package, and demonstrate its performance.
Bas Nijholt's avatar
Bas Nijholt committed
We provide a reference implementation, the open-source Python package called Adaptive [@Nijholt2019], which has previously been used in several scientific publications [@Vuik2018; @Laeven2019; @Bommer2019; @Melo2019].
Bas Nijholt's avatar
Bas Nijholt committed
It has algorithms for $f \colon \mathbb{R}^N \to \mathbb{R}^M$, where $N, M \in \mathbb{Z}^+$ but which work best when $N$ is small; integration in $\mathbb{R}$; and the averaging of stochastic functions.
Joseph Weston's avatar
Joseph Weston committed
Most of our algorithms allow for a customizable loss function with which one can adapt the sampling algorithm to work optimally for different classes of functions.
It integrates with the Jupyter notebook environment as well as popular parallel computation frameworks such as `ipyparallel`, `mpi4py`, and `dask.distributed`.
It provides auxiliary functionality such as live-plotting, inspecting the data as the calculation is in progress, and automatically saving and loading of the data.
The raw data and source code that produces all plots in this paper is available at [@papercode].

Bas Nijholt's avatar
Bas Nijholt committed
# Review of adaptive sampling{#sec:review}
Bas Nijholt's avatar
Bas Nijholt committed
Optimal sampling and planning based on data is a mature field with different communities providing their own context, restrictions, and algorithms to solve their problems.
To explain the relation of our approach with prior work, we discuss several existing contexts.
This is not a systematic review of all these fields, but rather, we aim to identify the important traits and design considerations.

Bas Nijholt's avatar
Bas Nijholt committed
#### Experiment design uses Bayesian sampling because the computational costs are not a limitation.
Bas Nijholt's avatar
Bas Nijholt committed
Optimal experiment design (OED) is a field of statistics that minimizes the number of experimental runs needed to estimate specific parameters and, thereby, reduce the cost of experimentation [@Emery1998].
Bas Nijholt's avatar
Bas Nijholt committed
It works with many degrees of freedom and can consider constraints, for example, when the sample space contains regions that are infeasible for practical reasons.
Bas Nijholt's avatar
Bas Nijholt committed
One form of OED is response-adaptive design [@Hu2006], which concerns the adaptive sampling of designs for statistical experiments.
Bas Nijholt's avatar
Bas Nijholt committed
Here, the acquired data (i.e., the observations) are used to estimate the uncertainties of a certain desired parameter.
Joseph Weston's avatar
Joseph Weston committed
It then suggests further experiments that will optimally reduce these uncertainties.
Bas Nijholt's avatar
Bas Nijholt committed
In this step of the calculation Bayesian statistics is frequently used.
Bayesian statistics naturally provides tools for answering such questions; however, because it provides closed-form solutions, Markov chain Monte Carlo (MCMC) sampling is the standard tool for determining the most promising samples. <!-- references missing! -->
Bas Nijholt's avatar
Bas Nijholt committed
In a typical non-adaptive experiment, decisions on which experiments to perform are made in advance.
Bas Nijholt's avatar
Bas Nijholt committed
#### Plotting and low dimensional integration uses local sampling.
Bas Nijholt's avatar
Bas Nijholt committed
Plotting a low dimensional function in between bounds requires one to evaluate the function on sufficiently many points such that when we interpolate values in between data points, we get an accurate description of the function values that were not explicitly calculated.
Bas Nijholt's avatar
Bas Nijholt committed
In order to minimize the number of function evaluations, one can use adaptive sampling routines.
Bas Nijholt's avatar
Bas Nijholt committed
For example, for one-dimensional functions, Mathematica [@WolframResearch] implements a `FunctionInterpolation` class that takes the function, $x_\textrm{min}$, and $x_\textrm{max}$, and returns an object that samples the function more densely in regions with high curvature; however, details on the algorithm are not published.
Bas Nijholt's avatar
Bas Nijholt committed
Subsequently, we can query this object for points in between $x_\textrm{min}$ and $x_\textrm{max}$, and get the interpolated value, or we can use it to plot the function without specifying a grid.
Joseph Weston's avatar
Joseph Weston committed
Another application for adaptive sampling is numerical integration.
Bas Nijholt's avatar
Bas Nijholt committed
It works by estimating the integration error of each interval and then minimizing the sum of these errors greedily.
Bas Nijholt's avatar
Bas Nijholt committed
For example, the `CQUAD` algorithm [@Gonnet2010] in the GNU Scientific Library [@Galassi1996] implements a more sophisticated strategy and is a doubly-adaptive general-purpose integration routine which can handle most types of singularities.
In general, it requires more function evaluations than the integration routines in `QUADPACK` [@Galassi1996]; however, it works more often for difficult integrands.
Bas Nijholt's avatar
Bas Nijholt committed
It is doubly-adaptive because it can decide to either subdivide intervals into more intervals or refine an interval by adding more points---that do not lie on a regular grid---to each interval.
Bas Nijholt's avatar
Bas Nijholt committed
#### PDE solvers and computer graphics use adaptive meshing.
Bas Nijholt's avatar
Bas Nijholt committed
Hydrodynamics [@Berger1989; @Berger1984] and astrophysics [@Klein1999] use an adaptive refinement of the triangulation mesh on which a partial differential equation is discretized.
Joseph Weston's avatar
Joseph Weston committed
By providing smaller mesh elements in regions with a higher variation of the solution, they reduce the amount of data and calculation needed at each step of time propagation.
Bas Nijholt's avatar
Bas Nijholt committed
The remeshing at each time step happens globally, and this is an expensive operation.
Bas Nijholt's avatar
Bas Nijholt committed
Therefore, mesh optimization does not fit our workflow because expensive global updates should be avoided.
Bas Nijholt's avatar
Bas Nijholt committed
Computer graphics uses similar adaptive methods where a smooth surface can represent a surface via a coarser piecewise linear polygon mesh, called a subdivision surface [@DeRose1998].
An example of such a polygonal remeshing method is one where the polygons align with the curvature of the space or field; this is called anisotropic meshing [@Alliez2003].
Bas Nijholt's avatar
Bas Nijholt committed
# Design constraints and the general algorithm

#### We aim to sample low to intermediate cost functions in parallel.
Joseph Weston's avatar
Joseph Weston committed
The general algorithm that we describe in this paper works best for low to intermediate cost functions.
Bas Nijholt's avatar
Bas Nijholt committed
The point suggestion step happens in a single sequential process while the function executions can be in parallel.
This means that to benefit from an adaptive sampling algorithm, that the time it takes to suggest a new point $t_\textrm{suggest}$ must be much smaller than the average function execution time $t_f$ over the number of parallel workers $N$: $t_f / N \gg t_\textrm{suggest}$.
Extremely fast functions can be calculated on a dense grid, and extremely slow functions might benefit from full-scale Bayesian optimization where $t_\textrm{suggest}$ is large.
Bas Nijholt's avatar
Bas Nijholt committed
We are interested in an intermediate case, when one may not fully run a fitting of all available data at each step; still, a large class of functions is inside the right regime for adaptive sampling to be beneficial.
Bas Nijholt's avatar
Bas Nijholt committed
#### We propose to use a local loss function as a criterion for choosing the next point.
Because we aim to keep the suggestion time $t_\textrm{suggest}$ small, we propose to use the following approach, which operates on a constant-size subset of the data to determine which point to suggest next.
We keep track of the subdomains in a priority queue, where each subdomain is assigned a priority called the "loss".
To suggest a new point we remove the subdomain with the largest loss from the priority queue and select a new point $x_\textrm{new}$ from within it (typically in the centre)
This splits the subdomain into several smaller subdomains $\{S_i\}$ that each contain $x_\textrm{new}$ on their boundaries.
After evaluating the function at $x_\textrm{new}$ we must then recompute the losses using the new data.
We choose to consider loss functions that are "local", i.e. the loss for a subdomain depends only on the points contained in that subdomain and possibly a (small) finite number of neighboring subdomains.
This means that we need only recalculate the losses for subdomains that are "close" to $x_\textrm{new}$.
Having computed the new losses we must then insert the $\{S_i\}$ into the priority queue, and also update the priorities of the neighboring subdomains, if their loss was recalculated.
After these insertions and updates we are ready to suggest the next point to evaluate.
Due to the local nature of this algorithm and the sparsity of space in higher dimensions, we will suffer from the curse of dimensionality.
Bas Nijholt's avatar
Bas Nijholt committed
The algorithm, therefore, works best in low dimensional space; typically calculations that can reasonably be plotted, so with 1, 2, or 3 degrees of freedom.
#### We summarize the algorithm with pseudocode

The algorithm descried above can be summarized by the following pseudocode.
In the following `queue` is the priority queue of subdomains, `domain` is an object that allows to efficiently query the neighbors of a subdomain and all subdomains containing a point $x$, `data` is a hashmap storing the points and their values, and `loss` is the loss function, with `loss.n_neighbors` being the degree of neighboring subdomains that the loss function uses.

```python
first_subdomain, = domain.subdomains()
for x in domain.points(first_subdomain):
  data[x] = f(x)

queue.insert(first_subdomain, priority=loss(domain, first_subdomain, data))

while queue.max_priority() < target_loss:
  loss, subdomain = queue.pop()

  new_points, new_subdomains = domain.split(subdomain)
  for x in new_points:
    data[x] = f(x)

  for subdomain in new_subdomains:
    queue.insert(subdomain, priority=loss(domain, subdomain, data))

  if loss.n_neighbors > 0:
    subdomains_to_update = reduce(
      set.union,
      (domain.neighbors(d, loss.n_neighbors) for d in new_subdomains),
      set(),
    )
    subdomains_to_update -= set(new_subdomains)
    for subdomain in subdomains_to_update:
      queue.update(subdomain, priority=loss(domain, subdomain, data))
```

Bas Nijholt's avatar
Bas Nijholt committed
#### As an example, the interpoint distance is a good loss function in one dimension.
Bas Nijholt's avatar
Bas Nijholt committed
An example of such a local loss function for a one-dimensional function is the interpoint distance, i.e. given a subdomain (interval) $(x_\textrm{a}, x_\textrm{b})$ with values $(y_\textrm{a}, y_\textrm{b})$ the loss is $\sqrt{(x_\textrm{a} - x_\textrm{b})^2 + (y_\textrm{a} - y_\textrm{b})^2}$.
A more complex loss function that also takes the first neighboring intervals into account is one that approximates the second derivative using a Taylor expansion.
Bas Nijholt's avatar
Bas Nijholt committed
Figure @fig:Learner1D shows a comparison between a result using this loss and a function that is sampled on a grid.
#### This algorithm has a logarithmic overhead when combined with an appropriate data structure
The key data structure in the above algorithm is the priority queue of subdomains.
It must support efficiently finding and removing the maximum priority element, as well as updating the priority of arbitrary elements whose priority is unknown (when updating the loss of neighboring subdomains).
Such a datastructure can be achieved with a combination of a hashmap (mapping elements to their priority) and a red--black tree or a skip list [@Cormen2009] that stores `(priority, element)`.
This has average complexity of $\mathcal{O}(\log{n})$ for all the required operations.
In the reference implementation, we use the SortedContainers Python package [@Jenks2014], which provides an efficient implementation of such a data structure optimized for realistic sizes, rather than asymptotic complexity.
Bas Nijholt's avatar
Bas Nijholt committed
Additionally, the algorithm requires efficient queries for subdomains that contain a point $x$, and the neighbors of a given subdomain.
For the one-dimensional case this could be achieved by using a red--black tree to keep the points $x$ in ascending order.

Bas Nijholt's avatar
Bas Nijholt committed
#### With many points, due to the loss being local, parallel sampling incurs no additional cost.
Bas Nijholt's avatar
Bas Nijholt committed
So far, the description of the general algorithm did not include parallelism.
In order to include parallelism we need to allow for points that are "pending", i.e. whose value has been requested but is not yet known.
In the sequential algorithm subdomains only contain points on their boundaries.
In the parallel algorithm *pending* points are placed in the interior of subdomains, and the loss of the subdomain is reduced to take these pending points into account.
Later, when a pending point $x$ is finally evaluated, we *split* the subdomain that contains $x$ such that it is on the boundary of new, smaller, subdomains.
We then calculate the loss of these new subdomains, and insert them into the priority queue, and update the losses of neighboring subdomains if required.
#### We summarize the algorithm with pseudocode
The parallel version of the algorithm can be summarized by the following pseudocode.
In the following `queue` is the priority queue of subdomains, `domain` is an object that allows to efficiently query the neighbors of a subdomain and create new subdomains by adding a point $x$, `data` is a hashmap storing the points and their values, `executor` allows to offload evaluation of a function `f` to external computing resources, and `loss` is the loss function, with `loss.n_neighbors` being the degree of neighboring subdomains that the loss function uses.
def scaled_loss(domain, subdomain, data):
    volumes = [subdomain.volume(d) for d in subdomain.subdomains()]
    max_relative_subvolume = max(volumes) / sum(volumes)
    L_0 = loss(domain, subdomain, data)
    return max_relative_subvolume * L_0

first_subdomain, = domain.subdomains()
for x in domain.points(first_subdomain):
  data[x] = f(x)

new_points = first_subdomain.insert_points(executor.ncores)
for x in new_points:
  data[x] = None
  executor.submit(f, x)

queue.insert(first_subdomain, priority=scaled_loss(domain, subdomain, data))

while executor.n_outstanding_points > 0:
  x, y = executor.get_one_result()
  data[x] = y

  # Split into smaller subdomains with `x` at a subdomain boundary
  # And calculate the losses for these new subdomains
  old_subdomains, new_subdomains = domain.split_at(x)
  for subdomain in old_subdomains:
    queue.remove(old_subdomain)
  for subdomain in new_subdomains:
    queue.insert(subdomain, priority=scaled_loss(domain, subdomain, data))

  if loss.n_neighbors > 0:
    subdomains_to_update = reduce(
      set.union,
      (domain.neighbors(d, loss.n_neighbors) for d in new_subdomains),
      set(),
    )
    subdomains_to_update -= set(new_subdomains)
    for subdomain in subdomains_to_update:
      queue.update(subdomain, priority=scaled_loss(domain, subdomain, data))
  # If it looks like we're done, don't send more work
  if queue.max_priority() < target_loss:
  # Send as many points for evaluation as we have compute cores
  for _ in range(executor.ncores - executor.n_outstanding_points)
    loss, subdomain = queue.pop()
    new_point, = subdomain.insert_points(1)
    data[new_point] = None
    executor.submit(f, new_point)
    # Send the new points for evaluation
    for x in new_points:
      data[x] = None
      executor.submit(f, x)
    queue.insert(subdomain, priority=scaled_loss(domain, subdomain, data))
Bas Nijholt's avatar
Bas Nijholt committed
# Loss function design

Bas Nijholt's avatar
Bas Nijholt committed
#### Sampling in different problems pursues different goals
Bas Nijholt's avatar
Bas Nijholt committed
Not all goals are achieved by using an identical sampling strategy; the specific problem determines the goal.
For example, quadrature rules based integration requires a denser sampling of the sections where the uncertainty of the interpolation is highest, plotting (or function approximation) requires continuity of the approximation, maximization only cares about finding an optimum, and isoline or isosurface sampling aims to sample regions near the iso level denser.
Bas Nijholt's avatar
Bas Nijholt committed

#### Different loss functions tailor sampling performance to different goals
Bas Nijholt's avatar
Bas Nijholt committed
To plot a function, the interpoint distance minimizing loss function we mentioned previously, works on many functions; however, it is easy to write down a function where it will fail.
For example, $1/x^2$ has a singularity at $x=0$ and will be sampled too densely around that singularity using a distance minimizing loss.
Bas Nijholt's avatar
Bas Nijholt committed
We can avoid this by defining additional logic inside the loss function.
Bas Nijholt's avatar
Bas Nijholt committed
#### Adding loss functions allows for balancing between multiple priorities.
Different loss functions prioritize sampling different features.
Adding loss functions allows for balancing between the multiple desired priorities.
For example, combining a loss function that calculates the curvature with a distance loss function, will sample regions with high curvature more densely, while ensuring continuity.

#### Loss function regularization avoids singularities
Bas Nijholt's avatar
Bas Nijholt committed
To avoid indefinitely sampling the function based on a distance loss alone, we can regularize the loss.
A simple (but not optimal) strategy is to limit the size of each interval in the $x$ direction using,

\begin{equation*}
Bas Nijholt's avatar
Bas Nijholt committed
L_{i, i+1}^\textrm{dist}=\sqrt{(x_{i+1}-x_{i})^{2}+(y_{i+1}-y_{i})^{2}},
\end{equation*}
\begin{equation*}
Bas Nijholt's avatar
Bas Nijholt committed
L_{i,i+1}^\textrm{reg}=\begin{cases}
\begin{array}{c}
0\\
L_{i, i+1}^\textrm{dist}(x_i, x_{i+1}, y_i, y_{i+1})
\end{array} & \begin{array}{c}
Bas Nijholt's avatar
Bas Nijholt committed
\textrm{if} \; x_{i+1}-x_{i}<\epsilon,\\
Bas Nijholt's avatar
Bas Nijholt committed
\textrm{else,}
\end{array}\end{cases}
\end{equation*}
Bas Nijholt's avatar
Bas Nijholt committed

where $\epsilon$ is the smallest resolution we want to sample.
Bas Nijholt's avatar
Bas Nijholt committed
#### Asymptotically dense sampling is achieved by adding subdomain volume to the loss
Bas Nijholt's avatar
Bas Nijholt committed
In two-dimensions (2D), subdomains are triangles, where its vertices are the known data points.
Losses are therefore calculated for each triangle but, unlike the 1D case, candidate points can be chosen at the center of the triangle or in the middle of the longest edge, if the triangulation becomes better as a result.
Bas Nijholt's avatar
Bas Nijholt committed
A distance loss equivalent in 2D is the area spanned by the three-dimensional (3D) vectors of the vertices of the triangle.
Using this loss function, some narrow features in otherwise flat regions might not be discovered initially.
It is therefore beneficial if a loss function has a property that eventually, all points should be sampled.
Bas Nijholt's avatar
Bas Nijholt committed
A loss functions that ensure this is a homogeneous loss function that returns the 2D area span by the $x, y$ coordinates.
Bas Nijholt's avatar
Bas Nijholt committed
However, this loss function does not use the function-values and is, therefore, by itself is not an efficient solution.
Ideally, interesting regions are sampled more densely, while simultaneously new potentially interesting regions are also discovered.
Bas Nijholt's avatar
Bas Nijholt committed
By adding the two loss functions, we can combine the 3D area loss to exploit interesting regions, while the 2D area loss explores less densely sampled regions that might contain interesting features.
Bas Nijholt's avatar
Bas Nijholt committed

# Examples

## Line simplification loss

Bas Nijholt's avatar
Bas Nijholt committed
#### The line simplification loss is based on an inverse Visvalingam’s algorithm.
Bas Nijholt's avatar
Bas Nijholt committed
Inspired by a method commonly employed in digital cartography for coastline simplification, Visvalingam's algorithm, we construct a loss function that does its reverse [@Visvalingam1990].
Bas Nijholt's avatar
Bas Nijholt committed
Here, at each point (ignoring the boundary points), we compute the effective area associated with its triangle, see Fig. @fig:line_loss(b).
Bas Nijholt's avatar
Bas Nijholt committed
The loss then becomes the average area of two adjacent triangles.
Bas Nijholt's avatar
Bas Nijholt committed
By Taylor expanding $f$ around $x$ it can be shown that the area of the triangles relates to the contributions of the second derivative.
Bas Nijholt's avatar
Bas Nijholt committed
We can generalize this loss to $N$ dimensions, where the triangle is replaced by a $(N+1)$ dimensional simplex.
Bas Nijholt's avatar
Bas Nijholt committed

![Line loss visualization.
Bas Nijholt's avatar
Bas Nijholt committed
In this example, we start with 6 points (a) on the function (grey).
Bas Nijholt's avatar
Bas Nijholt committed
Ignoring the endpoints, the effective area of each point is determined by its associated triangle (b).
The loss of each interval can be computed by taking the average area of the adjacent triangles.
Bas Nijholt's avatar
Bas Nijholt committed
Subplots (c), (d), and (e) show the subsequent interations following (b).](figures/line_loss.pdf){#fig:line_loss}
Bas Nijholt's avatar
Bas Nijholt committed
In order to compare sampling strategies, we need to define some error.
We construct a linear interpolation function $\tilde{f}$, which is an approximation of $f$.
We calculate the error in the $L^{1}$-norm, defined as,
$$
\text{Err}_{1}(\tilde{f})=\left\Vert \tilde{f}-f\right\Vert _{L^{1}}=\int_{a}^{b}\left|\tilde{f}(x)-f(x)\right|\text{d}x.
$$
This error approaches zero as the approximation becomes better.

Bas Nijholt's avatar
Bas Nijholt committed
![The $L^{1}$-norm error as a function of number of points $N$ for the functions in Fig. @fig:Learner1D (a,b,c).
Bas Nijholt's avatar
Bas Nijholt committed
The interrupted lines correspond to homogeneous sampling and the solid line to the sampling with the line loss.
Bas Nijholt's avatar
Bas Nijholt committed
In all cases adaptive sampling performs better, where the error is a factor 1.6-20 lower for $N=10000$.
](figures/line_loss_error.pdf){#fig:line_loss_error}

Bas Nijholt's avatar
Bas Nijholt committed
Figure @fig:line_loss_error shows this error as a function of the number of points $N$.
Bas Nijholt's avatar
Bas Nijholt committed
Here, we see that for homogeneous sampling to get the same error as sampling with a line loss, a factor $\approx 1.6-20$ times more points are needed, depending on the function.
Bas Nijholt's avatar
Bas Nijholt committed

## A parallelizable adaptive integration algorithm based on cquad

Bas Nijholt's avatar
Bas Nijholt committed
#### The `cquad` algorithm belongs to a class that is parallelizable.
In @sec:review we mentioned the doubly-adaptive integration algorithm `CQUAD` [@Gonnet2010].
Bas Nijholt's avatar
Bas Nijholt committed
This algorithm uses a Clenshaw-Curtis quadrature rules of increasing degree $d$ in each interval [@Clenshaw1960].
Bas Nijholt's avatar
Bas Nijholt committed
The error estimate is $\sqrt{\int{\left(f_0(x) - f_1(x)\right)^2}}$, where $f_0$ and $f_1$ are two successive interpolations of the integrand.
To reach the desired total error, intervals with the maximum absolute error are improved.
Either (1) the degree of the rule is increased or (2) the interval is split if either the function does not appear to be smooth or a rule of maximum degree ($d=4$) has been reached.
All points inside the intervals can be trivially calculated in parallel; however, when there are more resources available than points, Adaptive needs to guess whether an (1) interval's should degree of the rule should be increased or (2) or the interval is split.
Here, we choose to always increase until $d=4$, after which the interval is split.
Bas Nijholt's avatar
Bas Nijholt committed
## isoline and isosurface sampling
We can find isolines or isosurfaces using a loss function that prioritizes intervals that are closer to the function values that we are interested in.
Bas Nijholt's avatar
Bas Nijholt committed
See Fig. @fig:isoline.
Bas Nijholt's avatar
Bas Nijholt committed
![Comparison of isoline sampling of $f(x,y)=x^2 + y^3$ at $f(x,y)=0.1$ using homogeneous sampling (left) and adaptive sampling (right) with the same amount of points $n=17^2=289$.
Bas Nijholt's avatar
Bas Nijholt committed
We plot the function interpolated on a grid (color) with the triangulation on top (white) where the function is sampled on the vertices.
The solid line (black) indicates the isoline at $f(x,y)=0.1$.
Bas Nijholt's avatar
Bas Nijholt committed
The isoline in the homogeneous case consists of 62 line segments and the adaptive case consists of 147 line segments.
Bas Nijholt's avatar
Bas Nijholt committed
](figures/isoline.pdf){#fig:isoline}
Bas Nijholt's avatar
Bas Nijholt committed

# Implementation and benchmarks
Bas Nijholt's avatar
Bas Nijholt committed

#### The learner abstracts a loss based priority queue.
Bas Nijholt's avatar
Bas Nijholt committed
We will now introduce Adaptive's API.
The object that can suggest points based on existing data is called a *learner*.
Bas Nijholt's avatar
Bas Nijholt committed
The learner abstracts a loss based priority queue.
Bas Nijholt's avatar
Bas Nijholt committed
We can either *ask* it for points or *tell* the *learner* new data points.
Bas Nijholt's avatar
Bas Nijholt committed
We can define a *learner* as follows
Bas Nijholt's avatar
Bas Nijholt committed
```python
from adaptive import Learner1D

def peak(x): # pretend this is a slow function
    a = 0.01
    return x + a**2 / (a**2 + x**2)
Bas Nijholt's avatar
Bas Nijholt committed
learner = Learner1D(peak, bounds=(-1, 1))

```
Bas Nijholt's avatar
Bas Nijholt committed

#### The runner orchestrates the function evaluation.
Bas Nijholt's avatar
Bas Nijholt committed
To drive the learner manually (not recommended) and sequentially, we can do
```python
def goal(learner):
    # learner.loss() = max(learner.losses)
    return learner.loss() < 0.01

while not goal(learner):
    points, loss_improvements = learner.ask(n=1)
    for x in points:  # len(points) == 1
        y = f(x)
        learner.tell(x, y)
```
Bas Nijholt's avatar
Bas Nijholt committed
To do this automatically (recommended) and in parallel (by default on all cores available) use
Bas Nijholt's avatar
Bas Nijholt committed
```python
from adaptive import Runner
runner = Runner(learner, goal)
```
Bas Nijholt's avatar
Bas Nijholt committed
This will return immediately because the calculation happens in the background.
Bas Nijholt's avatar
Bas Nijholt committed
That also means that as the calculation is in progress, `learner.data` is accessible and can be plotted with `learner.plot()`.
Additionally, in a Jupyter notebook environment, we can call `runner.live_info()` to display useful information.
To change the loss function for the `Learner1D` we pass a loss function, like
Bas Nijholt's avatar
Bas Nijholt committed
```python
Bas Nijholt's avatar
Bas Nijholt committed
def distance_loss(xs, ys): # used by default
Bas Nijholt's avatar
Bas Nijholt committed
    dx = xs[1] - xs[0]
    dy = ys[1] - ys[0]
Bas Nijholt's avatar
Bas Nijholt committed
    return np.hypot(dx, dy)
Bas Nijholt's avatar
Bas Nijholt committed

Bas Nijholt's avatar
Bas Nijholt committed
learner = Learner1D(peak, bounds=(-1, 1), loss_per_interval=distance_loss)
```
Creating a homogeneous loss function is as simple as
```python
Bas Nijholt's avatar
Bas Nijholt committed
def uniform_loss(xs, ys):
    dx = xs[1] - xs[0]
    return dx

Bas Nijholt's avatar
Bas Nijholt committed
learner = Learner1D(peak, bounds=(-1, 1), loss_per_interval=uniform_loss)
Bas Nijholt's avatar
Bas Nijholt committed
```
Bas Nijholt's avatar
Bas Nijholt committed
We have also implemented a `LearnerND` with a similar API
```python
from adaptive import LearnerND

def ring(xy): # pretend this is a slow function
    x, y = xy
    a = 0.2
Bas Nijholt's avatar
Bas Nijholt committed
    return x + np.exp(-(x**2 + y**2 - 0.75**2)**2/a**4)
Bas Nijholt's avatar
Bas Nijholt committed

learner = adaptive.LearnerND(ring, bounds=[(-1, 1), (-1, 1)])
runner = Runner(learner, goal)
```

Bas Nijholt's avatar
Bas Nijholt committed
Again, it is possible to specify a custom loss function using the `loss_per_simplex` argument.
Bas Nijholt's avatar
Bas Nijholt committed
#### The BalancingLearner can run many learners simultaneously.
Bas Nijholt's avatar
Bas Nijholt committed
Frequently, more than one function (learner) needs to run at once, to do this we have implemented the `BalancingLearner`, which does not take a function, but a list of learners.
Bas Nijholt's avatar
Bas Nijholt committed
This learner internally asks all child learners for points and will choose the point of the learner that maximizes the loss improvement; thereby, it balances the resources over the different learners.
Bas Nijholt's avatar
Bas Nijholt committed
We can use it like
```python
from functools import partial
from adaptive import BalancingLearner

def f(x, pow):
    return x**pow

learners = [Learner1D(partial(f, pow=i)), bounds=(-10, 10) for i in range(2, 10)]
bal_learner = BalancingLearner(learners)
runner = Runner(bal_learner, goal)

```
Bas Nijholt's avatar
Bas Nijholt committed
For more details on how to use Adaptive, we recommend reading the tutorial inside the documentation [@Nijholt2018].
Bas Nijholt's avatar
Bas Nijholt committed

# Possible extensions

Bas Nijholt's avatar
Bas Nijholt committed
#### Anisotropic triangulation would improve the algorithm.
Bas Nijholt's avatar
Bas Nijholt committed
The current implementation of choosing the candidate point inside a simplex (triangle in 2D) with the highest loss, for the `LearnerND`, works by either picking a point (1) in the center of the simplex or (2) by picking a point on the longest edge of the simplex.
Bas Nijholt's avatar
Bas Nijholt committed
The choice depends on the shape of the simplex, where the algorithm tries to create regular simplices.
Bas Nijholt's avatar
Bas Nijholt committed
Alternatively, a good strategy is choosing points somewhere on the edge of a triangle such that the simplex aligns with the gradient of the function; creating an anisotropic triangulation [@Dyn1990].
Bas Nijholt's avatar
Bas Nijholt committed
This is a similar approach to the anisotropic meshing techniques mentioned in the literature review.
Bas Nijholt's avatar
Bas Nijholt committed
#### Learning stochastic functions is a promising direction.
Bas Nijholt's avatar
Bas Nijholt committed
Stochastic functions frequently appear in numerical sciences.
Currently, Adaptive has a `AverageLearner` that samples a stochastic function with no degrees of freedom until a certain standard error of the mean is reached.
This is advantageous because no predetermined number of samples has to be set before starting the simulation.
Extending this learner to be able to deal with more dimensions would be a useful addition.
Bas Nijholt's avatar
Bas Nijholt committed
#### Experimental control needs to deal with noise, hysteresis, and the cost for changing parameters.
Bas Nijholt's avatar
Bas Nijholt committed
Finally, there is the potential to use Adaptive for experimental control.
Experiments often deal with noise, which could be solved by taking multiple measurements and averaging over the outcomes, such as the (not yet existing) `AverageLearnerND` will do.
Another challenge in experiments is that changing parameters can be slow.
Sweeping over one dimension might be faster than in others; for example, in condensed matter physics experiments, sweeping the magnetic field is much slower than sweeping frequencies.
Bas Nijholt's avatar
Bas Nijholt committed
Additionally, some experiments exhibit hysteresis, which means that the sampling direction has to be restricted to certain paths.
Bas Nijholt's avatar
Bas Nijholt committed
All these factors have to be taken into account to create a general-purpose sampler that can be used for experiments.
However, Adaptive can already be used in experiments that are not restricted by the former effects.
Bas Nijholt's avatar
Bas Nijholt committed

<!-- We can include things like:
* Asymptotically complexity of algorithms
* Setting of the problem, which classes of problems can be handled with Adaptive
* Loss-functions examples (maybe include [Adaptive quantum dots](https://chat.quantumtinkerer.tudelft.nl/chat/channels/adaptive-quantum-dots))
* Trials, statistics (such as measuring timings)
* Line simplification algorithm as a general criterium
* Desirable properties of loss-functions
* List potential applications