Stochastic approximation method using diagonal positive-definite matrices for convex optimization with fixed point constraints

This paper proposes a stochastic approximation method for solving a convex stochastic optimization problem over the fixed point set of a quasinonexpansive mapping. The proposed method is based on the existing adaptive learning rate optimization algorithms that use certain diagonal positive-definite matrices for training deep neural networks. This paper includes convergence analyses and convergence rate analyses for the proposed method under specific assumptions. Results show that any accumulation point of the sequence generated by the method with diminishing step-sizes almost surely belongs to the solution set of a stochastic optimization problem in deep learning. Additionally, we apply the learning methods based on the existing and proposed methods to classifier ensemble problems and conduct a numerical performance comparison showing that the proposed learning methods achieve high accuracies faster than the existing learning method.


Introduction
Convex stochastic optimization problems in which the objective function is the expectation of convex functions are considered important due to their occurrence in practical applications, such as machine learning and deep learning.
The classical method for solving these problems is the stochastic approximation (SA) method [1, (5.4.1)], [2, Algorithm 8.1], [3], which is applicable when unbiased estimates of (sub)gradients of an objective function are available. Modified versions of the SA method, such as the mirror descent SA method [4,Sects. 3 and 4], [ convex stochastic optimization problems in deep neural networks. These algorithms use the inverses of diagonal positive-definite matrices at each iteration to adapt the learning rates of all model parameters. Hence, these algorithms are called adaptive learning rate optimization algorithms.
The above-mentioned methods commonly assume that metric projection onto a given constraint set is computationally possible. However, although the metric projection onto a simple convex set, such an affine subspace, half-space, or hyperslab, can be easily computed, the projection onto a complicated set, such as the intersections of simple convex sets, the set of minimizers of a convex function, or the solution set of a monotone variational inequality, cannot be easily computed. Accordingly, it is difficult to apply the abovementioned methods to stochastic optimization problems with complicated constraints.
In order to solve a stochastic optimization problem over a complicated constraint set, we define a computable quasinonexpansive mapping whose fixed point set coincides with the constraint set, which is possible for the above-mentioned complicated convex sets (see Sect. 3.1 and Example 4.1 for examples of computable quasinonexpansive mappings). Accordingly, the present paper deals with a convex stochastic optimization problem over the fixed point set of a computable quasinonexpansive mapping.
Since useful fixed point algorithms have already been reported [10,Chap. 5], [11,, [12][13][14][15][16], we can find fixed points of quasinonexpansive mappings, which are feasible points of the convex stochastic optimization problem. By combining the SA method with an existing fixed point algorithm, we could obtain algorithms [17, Algorithms 1 and 2] for solving convex stochastic optimization problems that can be applied to classifier ensemble problems [18,19] (Example 4.1(ii)), which arise in the field of machine learning. However, the existing algorithms converge slowly [17] due to being stochastic first-order methods. In this paper, we propose an algorithm (Algorithm 1) for solving a convex stochastic optimization problem (Problem 3.1) that performs better than the algorithms in [17,Algorithms 1 and 2]. The algorithm proposed herein is based on useful adaptive learning rate optimization algorithms, such as Adam and AMSGrad, that use certain diagonal positive-definite matrices. 1 The first contribution of the present study is an analysis of the convergence of the proposed algorithm (Theorem 5.1). This analysis finds that, if sufficiently small constant step-sizes are used, then the proposed algorithm approximates a solution to the problem (Theorem 5.2). Moreover, for sequences of diminishing step-sizes, the convergence rates of the proposed algorithm can be specified (Theorem 5.3 and Corollary 5.1).
We compare the proposed algorithm with the existing adaptive learning rate optimization algorithms for a constrained convex stochastic optimization problem in deep learning (Example 4.1(i)). Although the existing adaptive learning rate optimization algorithms achieve low regret, they cannot solve the problem. The second contribution of the present study is to show that, unlike the existing adaptive learning rate optimization algorithms, the proposed algorithm can solve the problem (Corollaries 5.2 and 5.3) (see Sect. 5.2 for details). The third contribution is that we show that the proposed algorithm can solve classifier ensemble problems and that the learning methods based on the proposed algorithm perform better numerically than the existing learning method based on the existing algorithms in [17]. In particular, the numerical results indicate that the learning methods which implies that H -1 G(x) = 0. From (1) and x ∈ Fix(Q f ,H ), we also have that which, together with f (x) > 0, gives H -1 G(x) = 0, which is a contradiction. Hence, we have that lev ≤0 f ⊃ Fix(Q f ,H ), i.e., lev ≤0 f = Fix(Q f ,H ). Accordingly, (i) ensures that Fix(Q f ,H ) = lev ≤0 f = Fix(Q f ). For all x ∈ R N \lev ≤0 f and all y ∈ lev ≤0 f , which, together with (2), implies that Q f ,H is firmly quasinonexpansive under the Hnorm.
for all x, y ∈ R N . Any nonexpansive mapping satisfies the quasinonexpansivity condition. The metric projection [10, Subchapter 4.2, Chap. 28] onto a nonempty, closed convex set C (⊂ R N ), denoted by P C , is defined for all x ∈ R N by P C (x) ∈ C and x -P C (x) = d(x, C) := inf y∈C xy . P C is firmly nonexpansive, i.e., P C (x) -P C (y) 2 + (Id -P C )(x) -

Convex stochastic optimization problem over fixed point set
This paper considers the following problem.
is a nonempty, closed convex set onto which the projection can be easily computed; where ξ is a random vector whose probability distribution P is supported on a set ⊂ R M and F : R N × → R. Then where one assumes that X is nonempty.
Examples of Q H n satisfying (A0) and (A1) are described in Sect. 3.1 and Example 4.1.

Related problems and their algorithms
Here, let us consider the following convex stochastic optimization problem [5, (1.1)]: where C ⊂ R N is nonempty, bounded, closed, and convex. The classical method for problem (3) [3] defined as follows: given x 0 ∈ R N and (λ n ) n∈N ⊂ (0, +∞), The SA method requires the metric projection onto C, and hence can be applied only to cases where C is simple in the sense that P C can be efficiently computed (e.g., C is a closed ball, half-space, or hyperslab [10,Chap. 28]). When C is not simple, the SA method requires solving the following subproblem at each iteration n: The mirror descent SA method [4,Sects. 3 and 4], [5,Sect. 2.3] is useful for solving problem (3) and has been analyzed for the case of step-sizes that are constant or diminishing. For example, the mirror descent SA method [5, (2.32), (2.38), and (2.47)] with a constant step-size policy generates the following sequence (x n 1 ) n∈N : given where ω : C → R is differentiable and convex, V : When ω(·) = (1/2) · 2 , x n+1 in (5) coincides with x n+1 in (4). Under certain assumptions, [5, (2.57)] for the rate of convergence of the mirror descent SA method with a diminishing step-size policy).
As the field of deep learning has developed, it has produced some useful stochastic optimization algorithms, such as AdaGrad [ (3). The AdaGrad algorithm is based on the mirror decent SA method (5) (see also [7, (4)]), and the RMSProp algorithm is a variant of AdaGrad. The Adam algorithm is based on a combination of RMSProp and the momentum method [26, (9)], as follows: given where β i > 0 (i = 1, 2), > 0, (λ n ) n∈N ⊂ (0, 1) is diminishing step-size, and A B denotes the Hadamard product of matrices A and B. If we define matrix H t as then the Adam algorithm (6) can be expressed as Unfortunately, there exists an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution [9, Theorem 2]. To guarantee convergence and preserve the practical benefits of Adam, AMSGrad [9, Algorithm 2] was proposed as follows: for (β 1,t ) t∈N ⊂ (0, +∞), The existing SA methods (4), (5), (6), and (9) (see also [6,27], [2,Sect. 8.5], and [5, Sect. 2.3]) require minimizing a certain convex function over C at each iteration. Therefore, when C has a complicated form (e.g., C is expressed as the set of all minimizers of a convex function over a closed convex set, the solution set of a monotone variational inequality, or the intersection of closed convex sets), it is difficult to compute the point x n+1 generated by any of (4), (5), (6), and (9) at each iteration.
Meanwhile, the fixed point theory [10,[28][29][30] enables us to define a computable quasinonexpansive mapping of which the fixed point set is equal to the complicated set. For example, let lev ≤0 f i (i = 1, 2, . . . , I) be the level set of a convex function f i : R N → R, and let X be the intersection of lev ≤0 f i , i.e., Let n ∈ N be fixed arbitrarily, and let H n ∈ S N ++ (see (A0)). Let Q f i ,H n : R N → R N (i = 1, 2, . . . , I) be the subgradient projection defined by (1) with f := f i and H := H n . Accordingly, Proposition 2.2 implies that Q f i ,H n is firmly quasinonexpansive under the H n -norm and Fix(Q f i ,H n ) = lev ≤0 f i (i = 1, 2, . . . , I). Under the condition that the subgradients of f i can be efficiently computed (see, e.g., [10,Chap. 16] for examples of convex functions with computable subgradients), Q f i ,H n also can be computed. Here, let us define Q H n : R N → R N as where (ω i ) I i=1 ⊂ (0, +∞) satisfies I i=1 ω i = 1. Then Q H n is quasinonexpansive under the H n -norm [10,Exercise 4.11]. Moreover, we have that where the second equality comes from Proposition 2.2(i) (i.e., Fix(Q f i ) = lev ≤0 f i (i = 1, 2, . . . , I)), the third equality comes from Proposition 2.2(ii) (i.e., Fix(Q f i ) = Fix(Q f i ,H n ) for all n ∈ N), and the fourth equality comes from [10,Proposition 4.34]. Therefore, (10), (11), and (12) imply that we can define a computable mapping Q H n satisfying (A1) of which the fixed point set is equal to the intersection of level sets. In the case where C is simple in the sense that P C = P C,I can be easily computed, I O and Q := P C obviously satisfy (A0) and (A1) with Fix(P C ) = C =: X. Accordingly, Problem 3.1 with Q := P C coincides with problem (3), which implies that Problem 3.1 is a generalization of problem (3). Fixed point algorithms exist for searching for a fixed point of a nonexpansive mapping [10,Chap. 5], [11,, [12][13][14][15][16]. The sequence (x n ) n∈N is generated by the Halpern fixed point algorithm [11,Subchapter 6.5], [12,16] as follows: for all n ∈ N, where x 0 ∈ R N , (α n ) n∈N ⊂ (0, 1) satisfies lim n→+∞ α n = 0 and +∞ n=0 α n = +∞, and Q : R N → R N is nonexpansive with Fix(Q) = ∅. The sequence (x n ) n∈N in (13) converges to the minimizer of the specific convex function f 0 (x) := (1/2) xx 0 2 (x ∈ R N ) over Fix(Q) (see, e.g., [11,Theorem 6.19]). From ∇f 0 (x) = xx 0 (x ∈ R N ), the Halpern algorithm (13) can be expressed as follows (see [31,32] for algorithms optimizing a general convex function): The following algorithm obtained by combining the SA method (4) with (14) for solving Problem 3.1 follows naturally from the above discussion: for all n ∈ N, where Q α := αId + (1α)Q (α ∈ (0, 1)). A convergence analysis of this algorithm for different step-size rules was performed in [17]. For example, algorithm (15) with a diminishing step-size was shown to converge in probability to a solution to Problem 3.1 with X = Fix(Q) [17, Theorem III.2]. The advantage of algorithm (15) is that it allows convex stochastic optimization problems with complicated constraints to be solved (see also (12)). From the fact stated in [17, Problem II.1] that the classifier ensemble problem [18,19], which is a central issue in machine learning, can be formulated as a convex stochastic optimization problem with complicated constraints, the classifier ensemble problem can be regarded as an example of Problem 3.1. This result implies that algorithm (15) can solve the classifier ensemble problem. However, this algorithm suffers from slow convergence, as shown in [17]. Specifically, although the learning methods based on algorithm (15) have higher accuracies than the previously proposed learning methods, they have longer elapsed times. Accordingly, we should consider developing stochastic optimization techniques to accelerate algorithm (15). This paper proposes an algorithm (Algorithm 1) based on useful stochastic gradient descent algorithms, such as Adam [8, Algorithm 1] and AMSGrad [9, Algorithm 2], for solving Problem 3.1, as a replacement for the existing stochastic firstorder method [17].
Find d n ∈ R N that solves H n d = -m n 6: 7: Proof Let x ∈ X ⊂ C and n ∈ N be fixed arbitrarily. The definition of x n+1 and the firm nonexpansivity of P C,H n guarantee that, almost surely, The definition of y n and (A1) ensure that, almost surely, The definitions of d n and m n in turn ensure that Hence, (16) implies that, almost surely, Moreover, the condition x n = x n (ξ [n-1] ) (n ∈ N) and (C1) guarantee that which, together with (C2), implies that Therefore, taking the expectation of (17) gives the first assertion of Lemma 4.1.
The definition of m n and (C3), together with the convexity of · 2 , guarantee that, for all n ∈ N, Induction thus ensures that, for all n ∈ N, where H N). From (18) and (by (A3)), we have that, for all n ∈ N, This completes the proof.
The convergence analyses of Algorithm 1 in Sect. 5 depend on the following assumption: Let us consider the case where H n and v n are defined for all n ∈ N by where β ∈ (0, 1) and v -1 =v -1 = 0 ∈ R N (see also (9)), and discuss the relationship between (A3) and (A4). Assumption (A4) implies that (x n ) n∈N ⊂ C generated by Algorithm 1 is almost surely bounded. In the standard case of , induction shows that, for all n ∈ N, almost surely, v n ≤ M 1 < +∞. Accordingly, (19) leads to the almost sure boundedness of (v n ) n∈N . Hence, h := sup{max i=1,2,...,N v n,i : n ∈ N} < +∞, which implies that (A3) holds. The above discussion shows that (A4) implies (A3) when H n and v n are as follows (see also (6) and (7)): We provide some examples of Problem 3.1 with (A0)-(A4) that can be solved by Algorithm 1 under (C1)-(C3).
Example 4.1 (i) Deep learning problem [9, p.2]: At each time step t, stochastic optimization algorithms used in training deep networks pick a point x t ∈ X with the parameters of the model to be learned, where X ⊂ R N is the simple, nonempty, bounded, closed convex feasible set of points, and then incur loss f t (x t ), where f t : R N → R is a convex loss function represented as the loss of the model with the chosen parameters in the next minibatch. Accordingly, the stochastic optimization problem in deep networks can be formulated as follows: where T is the total number of rounds in the learning process, and (H n ) n∈N ⊂ S N ++ ∩ D N defined by each of (19) and (20) satisfies (A0). Q H n := P X,H n (n ∈ N) satisfies (A1), and f (·) = E[f ξ (·)] := (1/T) T t=1 f t (·) satisfies (A2). Setting C := X ensures (A4), which implies (A3). Algorithm 1 for solving problem (21) is as follows: and z n m is the measure corresponding to the mth sample in the sample set and the nth classifier in an ensemble. The classifier ensemble problem with sparsity learning is the following: where · 1 denotes the 1 -norm and t 1 is the sparsity control parameter. Suppose that H n is as each of (19) and (20), which satisfies (A0), and define a mapping Q H n : Since the projections P R N + ,H n and P {x∈R N : x 1 ≤t 1 },H n can be easily computed [34, Lemma 1.1], Q H n defined by (24) can be also computed. Moreover, Q H n defined by (24) is nonexpansive with X = n∈N Fix(Q H n ), i.e., (A1) holds. Since {x ∈ R N : x 1 ≤ t 1 } is bounded, we can set a simple, bounded set C such that X ⊂ C, i.e., (A4) holds. Moreover, f in problem (23) satisfies (A2).
The classifier ensemble problem with both sparsity and diversity learning is as follows: where t 2 is the diversity control parameter, From the discussion regarding (10), (11), and (12), a mapping with (H n ) n∈N ⊂ S N ++ ∩ D N defined by each of (19) and (20), is quasinonexpansive under the H n -norm satisfying X = n∈N Fix(Q H n ), i.e., (A1) holds. The discussion in the previous paragraph implies that (A0), (A2), and (A4) again hold.
Algorithm 1 for solving each of problems (23) and (25) is represented as follows: In contrast to Adam (6) and AMSGrad (9) that can solve a convex stochastic optimization problem with a simple constraint (3) (see also problem (21)), algorithm (27)  applied to a convex stochastic optimization problem with complicated constraints, such as problems (23) and (25).
(iii) Network utility maximization problem [35, (6), (7)] (see also [36,Problem II.1]): The network resource allocation problem is to determine the source rates that maximize the utility aggregated over all sources over the link capacity constraints and source constraints. This problem can be formulated as the following network utility maximization problem: where x s denotes the transmission rate of source s ∈ S, u s is a concave utility function of source s, S(l) denotes the set of sources that use link l ∈ L, C l is the capacity constraint set of link l having capacity c l ∈ R + defined by C l := {x = (x s ) s∈S : s∈S(l) x s ≤ c l }, and C s is the constraint set of source s having the maximum allowed rate M s defined by C s := {x = (x s ) s∈S : x s ∈ [0, M s ]}. Since C l and C s are half-spaces, the projections P C l ,H n and P C s ,H n are easily computed, 3 where (H n ) n∈N ⊂ S N ++ ∩ D N is defined by each of (19) and (20). For example, we can define a nonexpansive mapping Q H n := l∈L P C l ,H n s∈S P C s ,H n satisfying X = n∈N Fix(Q H n ). The boundedness of s∈S C s allows us to set a simple, bounded set C satisfying C ⊃ s∈S C s ⊃ X. Algorithm (27) with G(x n , ξ n ) ∈ ∂(-u ξ n )(x n ) can be applied to problem (28).

Convergence analyses of Algorithm 1
For convergence analyses of Algorithm 1, we prove the following theorem.
and that H n = diag(h n,i ) satisfies 4 Then Algorithm 1 is such that the following are satisfied for all n ≥ 1: Proof Let x ∈ X be fixed arbitrarily. Lemma 4.1 guarantees that, for all k ∈ N, Summing the above inequality ensures that, for all n ≥ 1, where (29) implies that b > 0 exists such that, for all n ∈ N, β n ≤ b < 1 andb :   N ). Accordingly, for all k ∈ N and all x : Hence, (33) ensures that, for all n ∈ N, From γ k ≤ γ k-1 (k ≥ 1) (see (29)) and (30), we have that Accordingly, for all n ∈ N, Hence, (32), together with E[ x 1 - which, together with the existence of a > 0 such that, for all n ∈ N, α n ≤ a < 1 (by (29)) and a := 1a, implies that The Cauchy-Schwarz inequality, together with D := max i=1,2,...,N sup{(x n,ix i ) 2 : n ∈ N} < +∞ and E[ m n ] ≤M := max{ m -1 2 , M 2 } (n ∈ N) (by Lemma 4.1), guarantees that, for all n ∈ N, Since E[ d n

2
H n ] ≤ h 2 M 2 (n ∈ N) holds (by Lemma 4.1), we have that, for all n ∈ N, Iiduka Therefore, (31), (35), (36), and (37), together with the convexity of f , imply that, for all n ∈ N, Lemma 4.1 ensures that, for all n ∈ N, A discussion similar to the one for obtaining (35) implies that The continuity of f (see (A2)) and (A4) mean thatM := sup{E[f (x)f (x n )] : n ∈ N} < +∞. Accordingly, an argument similar to the one for obtaining (36) and (37) guarantees that, for all n ∈ N, From (29), there exists c > 0 such that, for all n ∈ N, c ≤ α n . Settingc := 1c, it follows that, for all n ∈ N, Iiduka A discussion similar to the one for obtaining (38) ensures that, for all n ∈ N, Suppose that (A1)' holds. Then we have that, for all k ∈ N, almost surely y k - Accordingly, (38) and (39) guarantee that, for all n ∈ N, which completes the proof.

Constant step-size rule
The following theorem indicates that sufficiently small constant step-sizes β n := β and λ n := λ allow a solution to the problem to be approximated.
Then Algorithm 1 with α n := α, β n := β, and λ n := λ (n ∈ N) satisfies that wherex n := (1/n) n k=1 x k andα := 1α. Under (A1)' , we have Proof We first show that, for all > 0, If (46) does not hold, then there exists 0 > 0 such that Let x ∈ X and χ n := E[ x nx 2 H n ] for all n ∈ N. Lemma 4.1, together with the proofs of (36) and (37), implies that, for all n ∈ N, From (34) and (A4), for all n ∈ N, Accordingly, (30) and (40) ensure that there exists n 0 ∈ N such that, for all n ≥ n 0 , Iiduka Hence, (48) implies that, for all n ≥ n 0 , From (47), there exists n 1 ∈ N such that, for all n ≥ n 1 , Therefore, for all n ≥ n 2 := max{n 0 , n 1 }, which is a contradiction since the right-hand side of the above inequality approaches minus infinity as n increases. Hence, (46) holds for all , which implies that (41) holds. A discussion similar to the one for showing (46) leads to (42). We next show that, for all > 0, If (50) does not hold for all > 0, then there exist 0 > 0 and n 3 ∈ N such that, for all n ≥ n 3 , Lemma 4.1, together with (48) and (49), ensures that, for all n ≥ n 0 , Accordingly, for all n ≥ n 4 := max{n 0 , n 3 },  Then Algorithm 1 satisfies that Let (β n ) n∈N and (λ n ) n∈N satisfy the following: Then the sequence (x n ) n∈N defined byx n := (1/n) n k=1 x k satisfies Moreover, if (A1)' holds, then we have Proof We first show (52). Lemma 4.1, together with (36), (37), and (48), implies that, for all n ∈ N, where for all x ∈ X and all n ∈ N. Consider (Case 1): For all x ∈ X, there exists m 0 ∈ N such that, for all n ∈ N, n ≥ m 0 implies χ n+1 (x) ≤ χ n (x). This case guarantees the existence of lim n→+∞ χ n (x) for all x ∈ X. From (30) and (40), we have that Moreover, (51) ensures that lim n→+∞ β n = lim n→+∞ λ n = 0. Accordingly, (55) and 0 < lim inf n→+∞ α n ≤ lim sup n→+∞ α n < 1 (by (29)) imply that Consider (Case 2): There exists x 0 ∈ X, for all m ∈ N, there exists n ∈ N such that n ≥ m and χ n+1 (x 0 ) > χ n (x 0 ). In this case, there exists (x n i ) i∈N ⊂ (x n ) n∈N such that, for all i ∈ N, χ n i +1 (x 0 ) > χ n i (x 0 ). From (55), we have that, for all i ∈ N, A discussion similar to the one for showing (56) guarantees that Therefore, we have (52). If (A1)' holds, then Lemma 4.1 implies that, for all n ∈ N, which implies that lim n→+∞ E[ y n -Q H n (x n ) H n ] = 0. In (Case 1), (56) and the triangle inequality mean that lim n→+∞ E[ x ny n H n ] = 0. Accordingly, the triangle inequality and Next, we show (53). Lemma 4.1, together with (36) and (37), ensures that, for all x ∈ X and all k ∈ N, where χ n := χ n (x ) for all x ∈ X and all n ∈ N. Summing the above inequality from k = 0 to k = n gives that, for all n ∈ N, which, together with (40) and (51), implies that If (53) does not hold, then there exist ζ > 0 and m 1 ∈ N such that, for all k ≥ m 1 , where the first equation comes from lim sup n→+∞ α n < 1, +∞ n=0 λ n = +∞, and +∞ n=0 β n λ n < +∞ (by (29) and (51)). Since we have a contradiction, (53) holds. Theorem 5.1, together with (40) and (54), ensures that with the convergence rate in Theorem 5.3.
Theorem 5.3 leads to the following corollary.

Corollary 5.2 Consider problem
wherex n := (1/n) n k=1 x k and (x n ) n∈N ⊂ X is the sequence in algorithm (22).
In contrast to Adam and AMSGrad with diminishing step-sizes, Corollary 5.2 indicates that algorithm (22) with constant step-sizes may approximate a solution of problem (21).
Corollary 5.1 implies the following corollary.

Corollary 5.3
Suppose that the assumptions in Corollary 5.1 hold and λ n := 1/n η (η ∈ [1/2, 1]), and (β n ) n∈N is such that +∞ n=1 β n < +∞. Under η ∈ (1/2, 1], algorithm (22) satisfies that Moreover, under η ∈ [1/2, 1), any accumulation point of (x n := (1/n) n k=1 x k ) n∈N almost surely belongs to the solution set of problem (21), and algorithm (22) achieves the following rate of convergence: Proof For problem (21) The second inequality guarantees that lim n→+∞ E[f (x n )f ] = 0. Letx ∈ X be an arbitrary accumulation point of (x n ) n∈N ⊂ X. Since there exists (x n i ) i∈N ⊂ (x n ) n∈N such that (x n i ) i∈N converges almost surely tox ∈ X, the continuity of f ensures that 0 x ∈ X . The rate of convergence of (x n ) n∈N is obtained from Corollary 5.1.
It is not guaranteed that x T defined by AMSGrad with λ t := α/ √ t optimizes T t=1 f t over X since (59) depends on a given parameter T, i.e., Meanwhile, Corollary 5.3 implies that any accumulation point of (x n ) n∈N defined by algorithm (22) with λ n := 1/ √ n almost surely belongs to the set of minimizers of T t=1 f t over X and (x n ) n∈N achieves an O(1/ √ n) convergence rate, i.e.,
The experiments used Mac Pro (Late 2013) with a 3.5 GHz 6-core Intel Xeon E5 CPU, 32 GB 1866 MHz DDR3 memory, and macOS Catalina version 10.15.1 operating system. The algorithms used in the experiments were written in Python 3.7.5 with the NumPy 1.17.4 package. The experiments used the datasets from LIBSVM [37] and the UCI Machine Learning Repository [38] for which information is shown in Table 1. In these experiments, stratified 10-fold cross-validation for the datasets was performed. For this validation, the  existing learning method based on SG was greater than 0.05. This implies that the existing and proposed methods based on C1, C2, C3, C4, D3, D4, and D6 had almost the same performances in the sense of accuracy. The p-value computed by the ANOVA test for the elapsed time was about 2.67 × 10 -29 (< 0.05). Table 5 indicates that there is a significant difference in the sense of the elapsed time between each of the proposed methods and the existing method based on SG. Therefore, the proposed methods ran significantly faster than the existing method based on SG.

Conclusion
In this paper, we proposed a stochastic approximation method based on adaptive learning rate optimization algorithms for solving a convex stochastic optimization problem over the fixed point set of a quasinonexpansive mapping. It also presented convergence analyses  Table 4 Multiple comparison for accuracies for the sparsity learning methods applied to the datasets in Table 1 using Tukey-Kramer's HSD test at the 5% significance level ("meandiffs" indicates the pairwise mean differences between Groups 1 and 2, "p-adj" indicates the adjusted p-value, and "Lower" (resp. "Upper") indicates the lower (resp. upper) value of the confidence interval for the pairwise mean differences) Group Table 5 Multiple comparison for elapsed time for the sparsity learning methods applied to the datasets in Table 1 using Tukey-Kramer's HSD test at the 5% significance level ("meandiffs" indicates the pairwise mean differences between Groups 1 and 2, "p-adj" indicates the adjusted p-value, and "Lower" (resp. "Upper") indicates the lower (resp. upper) value of the confidence interval for the pairwise mean differences) Group  of the proposed method with constant and diminishing step-sizes. The analyses confirm that any accumulation point of the sequence generated by the proposed method almost surely belongs to the solution set of the stochastic optimization problem in deep learning. We also compared the proposed algorithm with the existing adaptive learning rate optimization algorithms and showed that the proposed algorithm achieved an O(1/ √ n) convergence rate which was not achieved for the existing adaptive learning rate optimization algorithms. Numerical results for the classifier ensemble problems demonstrated that the proposed learning methods achieve high accuracies faster than the existing learning method based on the first-order algorithm. In particular, the proposed methods with constant step-sizes or Armijo line search step-sizes solve the classifier ensemble problems faster than the existing method based on the first-order algorithm.