Feasibility-based Fixed Point Networks

Inverse problems consist of recovering a signal from a collection of noisy measurements. These problems can often be cast as feasibility problems; however, additional regularization is typically necessary to ensure accurate and stable recovery with respect to data perturbations. Hand-chosen analytic regularization can yield desirable theoretical guarantees, but such approaches have limited effectiveness recovering signals due to their inability to leverage large amounts of available data. To this end, this work fuses data-driven regularization and convex feasibility in a theoretically sound manner. This is accomplished using feasibility-based fixed point networks (F-FPNs). Each F-FPN defines a collection of nonexpansive operators, each of which is the composition of a projection-based operator and a data-driven regularization operator. Fixed point iteration is used to compute fixed points of these operators, and weights of the operators are tuned so that the fixed points closely represent available data. Numerical examples demonstrate performance increases by F-FPNs when compared to standard TV-based recovery methods for CT reconstruction and a comparable neural network based on algorithm unrolling.


Introduction
Inverse problems arise in numerous applications such as medical imaging [1][2][3][4], phase retrieval [5][6][7], geophysics [8][9][10][11][12][13], and machine learning [14][15][16][17][18]. The goal of inverse problems is to recover a signal [2] u d from a collection of indirect noisy measurements d. These quantities are typically related by a linear mapping A via where ε is measurement noise. Inverse problems are often ill-posed, making recovery of the signal u d unstable for noise-affected data d. To overcome this, traditional approaches estimate the signal u d by a solutionũ d to the variational problem min u (Au, d) + J(u), where is a fidelity term that measures the discrepancy between the measurements and the application of the forward operator A to the signal estimate (e.g. least [1] Codes are available on Github: github.com/howardheaton/feasibility fixed point networks [2] While we refer to signals, this phrase is generally meant to describe objects of interest that can be represented mathematically (e.g. images, parameters of a differential equation, and points in a Euclidean space).
An underlying theme of regularization is that signals represented in high dimensional spaces often exhibit a common structure. Although hand picked regularizers may admit desirable theoretical properties leveraging a priori knowledge, they are typically unable to leverage available data. An ideal regularizer will leverage available data to best capture the core properties that should be exhibited by output reconstruction estimates of true signals. Neural networks have demonstrated great success in this regard, achieving state of the art results [33,34]. However, purely data-driven machine learning approaches do little to leverage the underlying physics of a problem, which can lead to poor compliance with data [35]. On the other hand, fast feasibility-seeking algorithms (e.g. see [36][37][38][39][40] and references therein) efficiently leverage known physics to solve inverse problems, being able to handle massive-scale sets of constraints [36,[41][42][43]. Thus, a relatively untackled question remains: Is it possible to fuse feasibility-seeking algorithms with data-driven regularization in a manner that improves reconstructions and yields convergence?
This work answers the above inquiry affirmatively. The key idea is to use machine learning techniques to create a mapping T Θ , parameterized by weights Θ. For fixed measurement data d, T Θ (· ; d) forms an operator possessing standard properties used in feasibility algorithms. Fixed point iteration is used to find fixed points of T Θ (· ; d) and the weights Θ are tuned such that these fixed points both resemble available signal data and are consistent with measurements (up to the noise level).
Contribution The core contribution of this work is to connect powerful feasibilityseeking algorithms to data-driven regularization in a manner that maintains theoretical guarantees. This is accomplished by presenting a feasibility-based fixed point network (F-FPN) framework that solves a learned feasibility problem. Numerical examples are provided that demonstrate notable performance benefits to our proposed formulation when compared to TV-based methods and fixed-depth neural networks formed by algorithm unrolling.
Outline We first overview convex feasibility problems (CFPs) and a learned feasibility problem (Section 2). Relevant neural network material is discussed next (Section 3), followed by our proposed F-FPN framework (Section 4). Numerical examples are then provided with discussion and conclusions (Sections 5 and 6).

Feasibility Problem
Convex feasibility problems (CFPs) arise in many real-world applications, e.g. imaging, sensor networks, radiation therapy treatment planning (see [36,44,45] and the references therein). We formalize the CFP setting and relevant methods as follows.
Let U and D be finite dimensional Hilbert spaces, referred to as the signal and data spaces, respectively. [3] Given additional knowledge about a linear inverse problem, measurement data d ∈ D can be used to express a CFP solved by the true signal u d ∈ U when measurements are noise-free. That is, data d can be used to define a collection {C d, } m =1 of closed convex subsets of D (e.g. hyperplanes) such that the true signal u d is contained in their intersection, i.e. u d solves the problem A common approach to solving (CFP), inter alia, is to use projection algorithms [46], which utilize orthogonal projections onto the individual sets C d, . For a closed, convex, and nonempty set C ⊆ U, the projection P C : U → C onto C, is defined by Projection algorithms are iterative in nature and each update uses combinations of projections onto each set C d, , relying on the principle that it is generally much easier to project onto the individual sets than onto their intersection. These methods date back to the 1930s [47,48] and have been adapted to now handle huge-size problems of dimensions for which more sophisticated methods cease to be efficient or even applicable due to memory requirements [36]. Computational simplicity derives from the fact the building bricks of a projection algorithm are the projections onto individual sets. Memory efficiency occurs because the algorithmic structure is either sequential or simultaneous (or hybrid) as in the block-iterative projection methods [49,50] and string-averaging projection methods [36,[51][52][53]. These algorithms generate sequences that solve (CFP) asymptotically, and the update operations can be iteration dependent (e.g. cyclic projections). We let A k d be the update operator for the k-th step of a projection algorithm solving (CFP). Consequently, each projection algorithm generates a sequence {u k } k∈N via the fixed point iteration A common assumption for such methods is the intersection of all the algorithmic operators' fixed point sets [4] contains or forms the desired set C d , i.e.
which automatically holds when {A k d } k∈N cycles over a collection of projections. [3] The product and norm are denoted by ·, · and · respectively. Although we use the same notation for each space, it will be clear from context which one is used. [4] For an operator T , its fixed point set is fix(T ) {u : u = T (u)}.

Data-Driven Feasibility Problem
As noted previously, inverse problems are often ill-posed, making (CFP) insufficient to faithfully recover the signal u d . Additionally, when noise is present, it can often be the case that the intersection is empty (i.e. C d = ∅). This calls for a different model to recover u d . To date, projection methods have limited inclusion of regularization (e.g. superiorization [54][55][56][57][58], sparsified Kaczmarz [59,60]). Beyond sparsity via 1 minimization, such approaches typically do not yield guarantees beyond feasibility (e.g. it may be desirable to minimize a regularizer over C d ). We propose composing a projection algorithm and a data-driven regularization operator in a manner so each update is analogous to a proximal-gradient step. This is accomplished via a parameterized mapping R Θ : U → U, with weights [5] denoted by Θ. This mapping directly leverages available data (explained in Section 3) to learn features shared among true signals of interest. We augment (CFP) by using operators {A k } k∈N for solving (CFP) and instead solve the learned common fixed points (L-CFP) problem Loosely speaking, when R Θ is chosen well, the signalũ d closely approximates u d .
We utilize classic operator results to solve (L-CFP). An operator T : Also, T is averaged if there exists α ∈ (0, 1) and a nonexpansive operator Q : U → U such that T (u) = (1 − α)u + αQ(u) for all u ∈ U. For example, the projection P S defined in (3) is averaged along with convex combinations of projections [61]. Our method utilizes the following standard assumptions, which are typically satisfied by projection methods (in the noise-free setting with R Θ as the identity).
When a finite collection of update operations are used and applied (essentially) cyclically, the previous assumption automatically holds (e.g. setting A k d P C d,i k and i k k mod(m) + 1). We use the learned fixed point iteration to solve (L-CFP) (L-FPI) [5] Operator weights are also commonly called parameters.
Affine map σ Nonexpansive nonlinearity Here R Θ is comprised of a finite sequence of applications of (possibly) distinct affine mappings (e.g. convolutions), and nonlinearities (e.g. projections on the nonnegative orthant, i.e. ReLUs). For each k ∈ N, we let A k d be a projection-based algorithmic operator. The parameters Θ of R Θ are tuned in an offline process by solving (9) to ensure signals are faithfully recovered.
Justification of the (L-FPI) iteration is provided by the following theorems, which are rewritten from their original form to a manner that matches the present context.
and contains a fixed point, then, for any Theorem 2.2 (Cegieslki [61]) If Assumptions 2.1 and 2.2 hold, and if {u k } k∈N is a sequence generated by the iteration (L-FPI) satisfying u k+1 − u k → 0, then {u k } k∈N converges to a limit u ∞ ∈ C Θ,d .

Fixed Point Networks Overview
One of the most promising areas in artificial intelligence is deep learning, a form of machine learning that uses neural networks containing many hidden layers [64,65]. Deep learning tasks in the context of this work can be cast as follows. Given measurements d drawn from a distribution P D and corresponding signals u d drawn from a distribution P U , we seek a mapping N Θ : D → U that approximates a oneto-one correspondence between the measurements and signals, i.e.
Depending on the nature of the given data, the task at hand can be regression or classification. In this work, we focus on solving regression problems where the learning is supervised, i.e. the loss function explicitly uses a correspondence between input and output data pairings. When the loss function does not use this correspondence (or when not all data pairings are available), the learning is semi-supervised if partial pairings are used and unsupervised if no pairings are used.

Recurrent Neural Networks
A common model for N Θ is given by Recurrent Neural Networks (RNNs) [66], which have enjoyed a great deal of success in natural language processing (NLPs) [67], time series [68], and classification [68]. An N -layer RNN takes observed data d as input and can be modeled as the N -fold composition of a mapping T Θ via Here T Θ (u; d) is an operator comprised of a finite sequence of applications of (possibly) distinct affine mappings and nonlinearities, and u is initialized to some fixed value (e.g. the zero vector). To identify a faithful mapping N Θ as in (7), we solve a training problem. This is modeled as finding weights that minimize an expected loss, which is typically solved using optimization methods like SGD [69] and Adam [70].
In particular, we solve the training problem where : U × U → R models the discrepancy between the prediction N Θ (d) of the network and the training data u d . In practice, the expectation in (9) is approximated using a finite subset of data, which is referred to as training data. In addition to minimizing over the training data, we aim for (7) to hold for a set of testing data that was not used during training (which tests the network's ability to generalize).
Remark 3.1 We emphasize a time intensive offline process is used to find a solution Θ to (9) (as is common in machine learning). After this, in the online setting, we apply N Θ (d) to recover a signal u d from its previously unseen measurements d, which is a much faster process.
Remark 3.2 If we impose a particular structure to T Θ , as shown in Figure 1, an N -layer RNN can be interpreted as an unrolled fixed point (or optimization) algorithm that runs for N iterations. Our experiments compare our proposed method to such an unrolled scheme.

Fixed-Point Networks
Increasing neural network depth leads to more expressibility [71,72]. A recent trend in deep learning seeks to inquires: what happens when the number of recurrent layers N goes to infinity? Due to ever growing memory requirements (growing linearly with N ) to train networks, directly unrolling a sequence generated by successively applying T Θ is, in general, intractable for arbitrarily large N . However, the sequence limit can be modeled using a fixed point equation. In this case, evaluating a fixed point network (FPN) [73] is equivalent to finding the unique fixed point of an averaged operator T Θ (· ; d), i.e. an FPN N Θ is defined by [6] [6] The presented definition is a slight variation of the original work, adapted to this setting.
Standard results [74][75][76] can be used to guarantee existence [7] of fixed points of nonexpansive T Θ . Iteratively applying T Θ produces a convergent sequence (Theorem 2.1). However, for different d, this procedure the number of steps to converge may vary, and so these models belong to the class of implicit depth models. As mentioned, it is computationally intractable to differentiate with respect to Θ by applying the chain rule through each of the N layers (when N is sufficiently large). Instead, the gradient d /dΘ is computed via the implicit function theorem [77]. Specifically, the gradient is obtained by solving the Jacobian-inverse equation (e.g. see [78][79][80]) Recent works that solve the Jacobian-inverse equation in (11) to train neural networks include Deep Equilibrium Networks [78,81] and Monotone Equilibrium Networks [79]. A key difficulty arises when computing the gradient via (11), especially when the signal space has large dimensions (e.g. when u d is a high-resolution image). Namely, a linear system involving the Jacobian term J Θ must be approximately solved to estimate the gradient of . Recently, a new framework for training implicit depth models, called Jacobian-Free Backpropagation (JFB) [73], was presented in the context of FPNs, which avoids the intensive linear system solves at each step. The idea is to replace gradient d /dΘ updates with ∂T /∂Θ, which is equivalent to a preconditioned gradient (since J −1 Θ is coercive [73, Lemma A.1]). JFB provides a descent direction and was found to be effective and competitive for training implicit-depth neural networks at substantially reduced computational costs. Since the present work solves inverse problems where the signal space has very high dimension, we leverage FPNs and JFB to solve (9) for our proposed method.

Learning to Optimize
An emerging field in machine learning is known as "learning to optimize" (L2O) (e.g. see the survey works [82,83]). As a paradigm shift away from conventional optimization algorithm design, L2O uses machine learning to improve an optimization method. Two approaches are typically used for model-based algorithms. Plug-and-Play (PnP) methods learn a denoiser in the form of a neural network and then plug this denoiser into an optimization algorithm (e.g. to replace a proximal for total variation). Here training of the denoiser is separate from the task at hand. On the other hand, unrolling methods incorporate tunable weights into an algorithm that is truncated to a fixed number of iterations, forming a neural network. Unrolling the iterative soft thresholding algorithm (ISTA), the authors in [84] obtained the first major L2O scheme Learned ISTA (LISTA) by letting each matrix in the updates be tunable. Follow-up papers also demonstrate empirical success in various applications, include compressive sensing [85][86][87][88][89][90][91][92][93], denoising [88,[93][94][95][96][97][98][99][100], and deblurring [88,93,95,[101][102][103][104][105][106]. L2O schemes are related to our method, but no L2O scheme has, to our knowledge, used a fixed point model as in (L-CFP). Additionally, our JFB training regime differs from the L2O unrolling and PnP schemes. [7] The original FPN paper used a more restrictive contraction condition to guarantee uniqueness and justify how the weights are updated during training. However, we use their method in our more general setting since the contraction factor can be arbitrarily close to unity.

Proposed Method
Herein we present the feasibility-based FPN (F-FPN). Although based on FPNs, here we replace the single operator of FPNs by a sequence of operators, each taking the form of a composition. Namely, we use updates in the iteration (L-FPI). The assumptions necessary for convergence can be approximately ensured (e.g. see Subsection 7.4 in the Appendix). This iteration yields the F-FPN N Θ , defined by assuming the intersection is unique. [8] This is approximately implemented via Algorithm 1. The weights Θ of the network N Θ are tuned by solving the training problem (9). In an ideal situation, the optimal weights Θ solving (9) would yield feasible outputs (i.e. N Θ (d) ∈ C d for all data d ∈ C) that also resemble the true signals u d . However, measurement noise in practice makes it unlikely that N Θ (d) is feasible, let alone that C d is nonempty. In the noisy setting, this is no longer a concern since we augment (CFP) via (L-CFP) and are ultimately concerned with recovering a signal u d , not solving a feasibility problem. In summary, our model is based on the underlying physics of a problem (via the convex feasibility structure), but is also steered by available data via the training problem (9). Illustrations of the efficacy of this approach are provided in Section 5.

Experiments
Experiments in this section demonstrate the relative reconstruction quality of F-FPNs and comparable schemes -in particular, filtered backprojection (FBP) [107], total variation (TV) minimization (similarly to [108,109]), total variation superiorization (based on [110,111]), and an unrolled L2O scheme with an RNN structure.

Experimental Setup
Comparisons are provided for two low-dose CT examples: a synthetic dataset, consisting of images of random ellipses, and the LoDoPab dataset [112], which consists of human phantoms. For both datasets, CT measurements are simulated with a parallel beam geometry with a sparse-angle setup of only 30 angles and 183 projection beams, resulting in 5,490 equations and 16,384 unknowns. Additionally, we add 1.5% Gaussian noise corresponding to each individual beam measurement. Moreover, the [8] Uniqueness is unlikely in practice; however, this assumption is justified since we use the same initial iterate u 1 for each initialization. This makes recovery of the same signal is stable with respect to changes in Θ.    images have a resolution of 128×128 pixels. The quality of the image reconstructions are determined using the Peak Signal-To-Noise Ratio (PSNR) and structural similarity index measure (SSIM). We use the PyTorch deep learning framework [113] and the ADAM [70] optimizer. We also use the Operator Discretization Library (ODL) python library [114] to compute the filtered backprojection solutions. The CT experiments are run on a Google Colab notebook. For all methods, we use a single diagonally relaxed orthogonal projections (DROP) [37] operator for A d (i.e. A k d = A d for all k), noting DROP is nonexpansive with respect to a norm dependent on A [115]. The loss function used for training is the mean squared error between reconstruction estimates and the corresponding true signals. We use a synthetic dataset consisting of random phantoms of combined ellipses as in [116]. The ellipse training and test sets contain 10,000 and 1,000 pairs, respectively. We also use phantoms derived from actual human chest CT scans via the benchmark Low-Dose Parallel Beam dataset (LoDoPaB) [112]. The LoDoPab training and test sets contain 20,000 and 2,000 pairs, respectively.

Experiment Methods
TV Superiorization Sequences generated by successively applying the operator A d are known to converge even in the presence of summable perturbations, which can be intentionally added to lower a regularizer value (e.g. TV) without compromising convergence, thereby giving a "superior" feasible point. Compared to minimization methods, superiorization typically only guarantees feasibility, but is often able to do so at reduced computational cost. This scheme, denoted as TVS, generates updates where D − and D + are the forward and backward differencing operators, ε > 0 is added for stability, and 20 iterations are used as early stopping to avoid overfitting to noise. The differencing operations yield a derivative of isotropic TV (e.g. see [117]). The scalars α > 0 and β ∈ (0, 1) are chosen to minimize training mean squared error. See the superiorization bibliography [118] for further TVS materials.
TV Minimization For a second analytic comparison method, we use anisotropic TV minimization (TVM). In this case, we solve the constrained problem where ε > 0 is a hand-chosen scalar reflecting the level of measurement noise and the box constraints on u are included since all signals have pixel values in the interval    [0, 1]. We use linearized ADMM [119] to solve (TVM) and refer to this model as TV minimization (TVM). Implementation details are in the Appendix.

F-FPN Structure
The architecture of the operator R Θ is modeled after the seminal work [120] on residual networks. The F-FPN and unrolled scheme both leverage the same structure R Θ and DROP operator for A d . The operator R Θ is the composition of four residual blocks. Each residual block takes the form of the identity mapping plus the composition of a leaky ReLU activation function and convolution (twice). The number of network weights in R Θ for each setup was 96,307, a small number by machine learning standards. Further details are provied in the Appendix.

Experiment Results
Our results show that F-FPN outperforms all classical methods as well as the unrolled data-driven method. We show the result on an individual reconstruction via wide and zoomed-in images from the ellipse and LoDoPab testing datasets in   Tables 1 and 2. We emphasize the type of noise depends on each individual ray in a similar manner to [121], making the measurements more noisy than some related works. This noise and ill-posedness of our underdetermined setup are illustrated by the poor quality of analytic method reconstructions. (However, we note improvement by using TV over FBP and further improvement by TV minimization over TV superiorization.) Although nearly identical in structure to F-FPNs, these results show the unrolled method to be inferior to F-FPNs in these experiments. We hypothesize this is due to the large memory requirements of unrolling (unlike F-FPNs), which limits the number of unrolled steps (∼ 20 steps versus 100+ steps of F-FPNs), and F-FPNs are tuned to optimize a fixed point condition rather than a fixed number of updates.

Conclusion
This work connects feasibility-seeking algorithms and data-driven algorithms (i.e. neural networks). The F-FPN framework leverages the elegance of fixed point methods while using state of the art training methods for implicit-depth deep learning. This results in a sequence of learned operators {A k d • R Θ } k∈N that can be repeatedly applied until convergence is obtained. This limit point is expected to be nearly compatible with provided constraints (up to the level of noise) and resemble the collection of true signals. The provided numerical examples show improved performance obtained by F-FPNs over both classic methods and an unrolling-based network. Future work will extend FPNs to a wider class of optimization problems and further establish theory connecting machine learning to fixed point methods.

Network Structure
For our neural network architecture, we set R Θ to be a composition of four convolutions: the first takes in one channel and outputs 44 channels. The second and third convolutions have 44 input and output channels. The final convolution maps the 44 channels back to one channel. Prior to each convolution, we use the leaky rectified linear activation function (ReLU) as the nonlinear activation function between layers. The leaky ReLU function, denoted by φ, is defined as where a is a number determined by the user. Exact implementation details can be found in https://github.com/howardheaton/feasibility_fixed_point_ networks.

Training Setup
To generate the FBP reconstructions, we use FBP operator from the Operator Discretization Library (ODL). Since the ODL FBP operator is a built-in operator whose rows are not normalized (unlike in the remainder of the methods where DROP is used), we scale the observed data accordingly. In particular, we multiply each row of the observed data d by the rows of the original, unnormalized matrix A. For all other methods, we normalized the rows of A and scaled the measurements accordingly.
For the ellipse dataset, we train the unrolled network using a batch size of 15 for 60 epochs. The F-FPN network training used a batch-size of 15 for 50 epochs. The unrolled network architecture contains 20 total layers (i.e. update steps) -the number of layers was chosen on the memory capacity of the GPU.
For the LoDoPab dataset, we train the unrolled and F-FPN networks using a batchsize of 50 for 50 epochs total. The unrolled network architecture contains 14 total layers (i.e. update steps) -the number of layers was chosen on the memory capacity of the GPU.

TVS parameters
The TVS parameters were trained by unrolling the method indicated in (13) for 20 steps into the structure of a neural network. This unrolled network contained 2 parameters: α and β. We initialized α to 0.05 and β to 0.99. Then we used Adam to tune the parameters with the training data. For the ellipse experiment, the learned parameters were α = 0.0023 and β = 0.968. For the LoDoPab experiment, the learned parameters were α = 0.0013 and β = 9607. Note the training to tune the parameters optimized performance with respect to mean squared error.

Approximate Lipschitz Enforcement
Herein we overview our technique for ensuring the composition (A d • R Θ ) is γ-Lipschitz in our experiments (with γ ∈ (0, 1]). This is accomplished in an approximate manner using batches of computed fixed points after each forward pass in training. Let B denote a set of indices corresponding to a collection of fixed points u d and let {ζ i } i∈B be Gaussian random vectors. Letting |B| denote the cardinality of B, we check whether the following equality holds: If the network is γ-Lipschitz, then C 1 ≤ γC 2 for any provided batch B of samples. Now suppose the inequality does not hold, the case where action must be taken. First assume A d is 1-Lipschitz. Then it suffices to make R Θ γ-Lipschitz. As noted previously, R Θ takes the form of a composition of ResNet blocks. For simplicity, suppose for a matrix W and vector b defined in terms of the weights Θ. Let C 3 γC 1 /C 2 .
To make (15) hold, it would be sufficient to replace R Θ by C 3 · R Θ . Furthermore, where the first approximation is an equality when W u + b ≥ 0 (and approximately equal when a is small), and the second approximation holds whenever the inequality (15) is "close" to hold, i.e. C 3 ≈ 1. This shows that Thus, to ensure R Θ is approximately γ-Lipschitz, we may do the following. After each forward pass in training (i.e. computing N Θ (d) for a batch B of data d), we compute C 1 and C 2 as above. If (15) holds, then no action is taken. If (15) does not hold, then multiply the weights W and b by C 3 , making (15) hold.
In our experiments, the structure of R Θ was a more complicated variation of the above case (namely, the residual portion was the composition of convolutions). However, we used the same normalization factor, which forces R Θ to be slightly more contractive than needed. And, in the general case where R Θ is the composition of mappings of the form identity plus residual, it suffices to multiply the weights by the normalization constant C 3 raised to one over the number of layers in the residual mapping (i.e. C 1/ 3 ).
Remark 7.1 An important note must be made with respect to normalization. Namely, R Θ was almost never updated by the procedure above. Because of the initialization of the weights Θ, R Θ appears to have been roughly 1-Lipschitz. And, because the weights are tuned to improve the performance of R Θ , it appears that this typically resulted in updates that did not make R Θ less contractive. Consequently, the above is an approximate safeguard, but did not appear necessary in practice to obtain our results.