Skip to main content

Stochastic approximation method using diagonal positive-definite matrices for convex optimization with fixed point constraints

Abstract

This paper proposes a stochastic approximation method for solving a convex stochastic optimization problem over the fixed point set of a quasinonexpansive mapping. The proposed method is based on the existing adaptive learning rate optimization algorithms that use certain diagonal positive-definite matrices for training deep neural networks. This paper includes convergence analyses and convergence rate analyses for the proposed method under specific assumptions. Results show that any accumulation point of the sequence generated by the method with diminishing step-sizes almost surely belongs to the solution set of a stochastic optimization problem in deep learning. Additionally, we apply the learning methods based on the existing and proposed methods to classifier ensemble problems and conduct a numerical performance comparison showing that the proposed learning methods achieve high accuracies faster than the existing learning method.

Introduction

Convex stochastic optimization problems in which the objective function is the expectation of convex functions are considered important due to their occurrence in practical applications, such as machine learning and deep learning.

The classical method for solving these problems is the stochastic approximation (SA) method [1, (5.4.1)], [2, Algorithm 8.1], [3], which is applicable when unbiased estimates of (sub)gradients of an objective function are available. Modified versions of the SA method, such as the mirror descent SA method [4, Sects. 3 and 4], [5, Sect. 2.3] and the accelerated SA method [6, Sect. 3.1], have been reported as useful methods for solving these problems. Meanwhile, some stochastic optimization algorithms have been proposed with the rapid development of deep learning. For example, AdaGrad [7, Figs. 1 and 2] is an algorithm based on the mirror descent SA method, and Adam [8, Algorithm 1], [2, Algorithm 8.7] and AMSGrad [9, Algorithm 2] are well known as powerful tools for solving convex stochastic optimization problems in deep neural networks. These algorithms use the inverses of diagonal positive-definite matrices at each iteration to adapt the learning rates of all model parameters. Hence, these algorithms are called adaptive learning rate optimization algorithms.

The above-mentioned methods commonly assume that metric projection onto a given constraint set is computationally possible. However, although the metric projection onto a simple convex set, such an affine subspace, half-space, or hyperslab, can be easily computed, the projection onto a complicated set, such as the intersections of simple convex sets, the set of minimizers of a convex function, or the solution set of a monotone variational inequality, cannot be easily computed. Accordingly, it is difficult to apply the above-mentioned methods to stochastic optimization problems with complicated constraints.

In order to solve a stochastic optimization problem over a complicated constraint set, we define a computable quasinonexpansive mapping whose fixed point set coincides with the constraint set, which is possible for the above-mentioned complicated convex sets (see Sect. 3.1 and Example 4.1 for examples of computable quasinonexpansive mappings). Accordingly, the present paper deals with a convex stochastic optimization problem over the fixed point set of a computable quasinonexpansive mapping.

Since useful fixed point algorithms have already been reported [10, Chap. 5], [11, Chaps. 2–9], [1216], we can find fixed points of quasinonexpansive mappings, which are feasible points of the convex stochastic optimization problem. By combining the SA method with an existing fixed point algorithm, we could obtain algorithms [17, Algorithms 1 and 2] for solving convex stochastic optimization problems that can be applied to classifier ensemble problems [18, 19] (Example 4.1(ii)), which arise in the field of machine learning. However, the existing algorithms converge slowly [17] due to being stochastic first-order methods. In this paper, we propose an algorithm (Algorithm 1) for solving a convex stochastic optimization problem (Problem 3.1) that performs better than the algorithms in [17, Algorithms 1 and 2]. The algorithm proposed herein is based on useful adaptive learning rate optimization algorithms, such as Adam and AMSGrad, that use certain diagonal positive-definite matrices.Footnote 1 The first contribution of the present study is an analysis of the convergence of the proposed algorithm (Theorem 5.1). This analysis finds that, if sufficiently small constant step-sizes are used, then the proposed algorithm approximates a solution to the problem (Theorem 5.2). Moreover, for sequences of diminishing step-sizes, the convergence rates of the proposed algorithm can be specified (Theorem 5.3 and Corollary 5.1).

Algorithm 1
figurea

Stochastic approximation method for solving Problem 3.1

We compare the proposed algorithm with the existing adaptive learning rate optimization algorithms for a constrained convex stochastic optimization problem in deep learning (Example 4.1(i)). Although the existing adaptive learning rate optimization algorithms achieve low regret, they cannot solve the problem. The second contribution of the present study is to show that, unlike the existing adaptive learning rate optimization algorithms, the proposed algorithm can solve the problem (Corollaries 5.2 and 5.3) (see Sect. 5.2 for details). The third contribution is that we show that the proposed algorithm can solve classifier ensemble problems and that the learning methods based on the proposed algorithm perform better numerically than the existing learning method based on the existing algorithms in [17]. In particular, the numerical results indicate that the learning methods based on the proposed algorithm with constant step-sizes or step-sizes computed by the Armijo line search algorithm can solve classifier ensemble problems faster than the existing learning method based on the algorithms in [17]. As a result, the proposed learning methods achieve high accuracies faster than the existing learning method.

Mathematical preliminaries

Definitions and propositions

Let \(\mathbb{N}\) be the set of all positive integers. Let \(\mathbb{R}^{N}\) be an N-dimensional Euclidean space with the inner product \(\langle \cdot , \cdot \rangle \) with the associated norm \(\|\cdot \|\), and let \(\mathbb{R}_{+}^{N} := \{(x_{i})_{i=1}^{N} \in \mathbb{R}^{N} \colon x_{i} \geq 0 \ (i=1,2,\ldots ,N) \}\). Let \(X^{\top }\) denote the transpose of matrix X, let I denote the identity matrix, and let Id denote the identity mapping on \(\mathbb{R}^{N}\). Let \(\mathbb{S}^{N}\) be the set of \(N \times N\) symmetric matrices, i.e., \(\mathbb{S}^{N} = \{ X \in \mathbb{R}^{N \times N} \colon X=X^{\top }\}\). Let \(\mathbb{S}^{N}_{++}\) denote the set of symmetric positive-definite matrices, i.e., \(\mathbb{S}^{N}_{++} = \{ X \in \mathbb{S}^{N} \colon X \succ O \}\). Given \(H \in \mathbb{S}_{++}^{N}\), the H-inner product of \(\mathbb{R}^{N}\) and the H-norm can be defined for all \(x,y\in \mathbb{R}^{N}\) by \(\langle x,y \rangle _{H} := \langle x, H y \rangle \) and \(\|x \|_{H}^{2} := \langle x, Hx \rangle \). Let \(\mathsf{diag}(x_{i})\) be an \(N \times N\) diagonal matrix with diagonal components \(x_{i} \in \mathbb{R}\) (\(i=1,2,\ldots ,N\)), and let \(\mathbb{D}^{N}\) be the set of \(N \times N\) diagonal matrices, i.e., \(\mathbb{D}^{N} = \{ X \in \mathbb{R}^{N \times N} \colon X = \mathsf{diag}(x_{i}), x_{i} \in \mathbb{R}\ (i=1,2, \ldots ,N) \}\).

Let \(\mathbb{E}[X]\) denote the expectation of random variable X. The history of the process \(\xi _{0},\xi _{1},\ldots \) up to time n is denoted by \(\xi _{[n]} = (\xi _{0},\xi _{1},\ldots ,\xi _{n})\). Let \(\mathbb{E}[X|\xi _{[n]}]\) denote the conditional expectation of X given by \(\xi _{[n]} = (\xi _{0},\xi _{1},\ldots ,\xi _{n})\). Unless stated otherwise, all relations between random variables are supported to hold almost surely.

The subdifferential [10, Definition 16.1], [20, Sect. 23] of a convex function \(f \colon \mathbb{R}^{N} \to \mathbb{R}\) is defined for all \(x\in \mathbb{R}^{N}\) by

$$\begin{aligned} \partial f (x) := \bigl\{ u\in \mathbb{R}^{N} \colon f (y) \geq f (x) + \langle y-x,u \rangle \ \bigl(y\in \mathbb{R}^{N}\bigr) \bigr\} . \end{aligned}$$

A point \(u \in \partial f(x)\) is called the subgradient of f at \(x\in \mathbb{R}^{N}\).

Proposition 2.1

([21, Theorem 4.1.3], [10, Propositions 16.14(ii), (iii)])

Let \(f \colon \mathbb{R}^{N} \to \mathbb{R}\) be convex. Then f is continuous and \(\partial f(x) \neq \emptyset \) for every \(x\in \mathbb{R}^{N}\). Moreover, for every \(x\in \mathbb{R}^{N}\), there exists \(\delta > 0\) such that \(\partial f(B(x;\delta ))\) is bounded, where \(B(x;\delta )\) is the closed ball with center x and radius δ.

When a mapping \(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is considered under the H-norm \(\|\cdot \|_{H}\), we denote it as \(Q_{H} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\). We define \(Q := Q_{I}\). A mapping \(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is said to be quasinonexpansive [10, Definition 4.1(iii)] if

$$\begin{aligned} \bigl\Vert Q(x) - y \bigr\Vert \leq \Vert x-y \Vert \end{aligned}$$

for all \(x\in \mathbb{R}^{N}\) and all \(y\in \mathrm{Fix}(Q)\), where \(\mathrm{Fix}(Q)\) is the fixed point set of Q defined by \(\mathrm{Fix}(Q) := \{ x \in \mathbb{R}^{N} \colon x = Q(x) \}\). When a quasinonexpansive mapping has one fixed point, its fixed point set is closed and convex [22, Proposition 2.6]. Q is called a firmly quasinonexpansive mapping [23, Sect. 3] if \(\| Q(x) - y \|^{2} + \| (\mathrm{Id} - Q)(x) \|^{2} \leq \| x-y \|^{2}\) for all \(x\in \mathbb{R}^{N}\) and all \(y\in \mathrm{Fix}(Q)\). Q is firmly quasinonexpansive if and only if \(R:= 2Q - \mathrm{Id}\) is quasinonexpansive [10, Proposition 4.2]. This means that \((1/2)(\mathrm{Id} + R)\) is firmly quasinonexpansive when R is quasinonexpansive. Given \(H \in \mathbb{S}_{++}^{N}\), we define the subgradient projectionFootnote 2 relative to a convex function \(f \colon \mathbb{R}^{N} \to \mathbb{R}\) by

$$\begin{aligned} Q_{f,H} (x) := \textstyle\begin{cases} {x - \frac{f(x)}{ \Vert H^{-1} G(x) \Vert _{H}^{2}} H^{-1} G(x)} &\text{if } f(x) > 0, \\ x &\text{otherwise}, \end{cases}\displaystyle \end{aligned}$$
(1)

where \(G(x)\) is any point in \(\partial f(x)\) (\(x\in \mathbb{R}^{N}\)) and \(\mathrm{lev}_{\leq 0} f := \{ x\in \mathbb{R}^{N} \colon f(x) \leq 0 \} \neq \emptyset \). The following proposition holds.

Proposition 2.2

Let \(H \in \mathbb{S}_{++}^{N}\) and let \(f\colon \mathbb{R}^{N} \to \mathbb{R}\) be convex. Then \(Q_{f,H} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) defined by (1) satisfies the following:

  1. (i)

    \(Q_{f} := Q_{f,I}\) is firmly quasinonexpansive and \(\mathrm{Fix}(Q_{f}) = \mathrm{lev}_{\leq 0} f\);

  2. (ii)

    \(Q_{f,H}\) is firmly quasinonexpansive under the H-norm with \(\mathrm{Fix}(Q_{f,H}) = \mathrm{Fix}(Q_{f})\).

Proof

(i) This follows from Proposition 2.3 in [22].

(ii) We first show that \(\mathrm{lev}_{\leq 0} f = \mathrm{Fix}(Q_{f,H})\). From (1), we have that \(\mathrm{lev}_{\leq 0} f \subset \mathrm{Fix}(Q_{f,H})\). Let \(x \in \mathrm{Fix}(Q_{f,H})\) and assume that \(x \notin \mathrm{lev}_{\leq 0} f\). Then the definition of the H-inner product and \(G(x) \in \partial f(x)\) mean that, for all \(y\in \mathrm{lev}_{\leq 0} f\),

$$\begin{aligned} \bigl\langle y-x, H^{-1} G(x) \bigr\rangle _{H} = \bigl\langle y-x, G(x) \bigr\rangle \leq f(y) - f(x) \leq -f(x) < 0, \end{aligned}$$
(2)

which implies that \(H^{-1} G(x) \neq 0\). From (1) and \(x \in \mathrm{Fix}(Q_{f,H})\), we also have that

$$\frac{f(x)}{\|H^{-1} G(x)\|_{H}^{2}} H^{-1} G(x) = x - Q_{f,H}(x) = 0,$$

which, together with \(f(x) > 0\), gives \(H^{-1} G(x)=0\), which is a contradiction. Hence, we have that \(\mathrm{lev}_{\leq 0} f \supset \mathrm{Fix}(Q_{f,H})\), i.e., \(\mathrm{lev}_{\leq 0} f = \mathrm{Fix}(Q_{f,H})\). Accordingly, (i) ensures that \(\mathrm{Fix}(Q_{f,H}) = \mathrm{lev}_{\leq 0} f = \mathrm{Fix}(Q_{f})\). For all \(x\in \mathbb{R}^{N} \backslash \mathrm{lev}_{\leq 0} f\) and all \(y\in \mathrm{lev}_{\leq 0} f\),

$$\begin{aligned} & \bigl\Vert Q_{f,H} (x) - y \bigr\Vert _{H}^{2} \\ &\quad = \Vert x - y \Vert _{H}^{2} + \frac{2 f(x)}{ \Vert H^{-1} G(x) \Vert _{H}^{2}} \bigl\langle y-x, H^{-1} G(x) \bigr\rangle _{H} + \frac{f(x)^{2}}{ \Vert H^{-1} G(x) \Vert _{H}^{2}}, \end{aligned}$$

which, together with (2), implies that \(Q_{f,H}\) is firmly quasinonexpansive under the H-norm. □

\(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is said to be Lipschitz continuous (L-Lipschitz continuous) if there exists \(L > 0\) such that \(\| Q(x) - Q(y) \| \leq L \| x-y \|\) for all \(x,y\in \mathbb{R}^{N}\). \(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is said to be nonexpansive [10, Definition 4.1(ii)] if Q is 1-Lipschitz continuous, i.e., \(\| Q(x) - Q(y) \| \leq \| x-y \|\) for all \(x,y\in \mathbb{R}^{N}\). Any nonexpansive mapping satisfies the quasinonexpansivity condition. The metric projection [10, Subchapter 4.2, Chap. 28] onto a nonempty, closed convex set C (\(\subset \mathbb{R}^{N}\)), denoted by \(P_{C}\), is defined for all \(x\in \mathbb{R}^{N}\) by \(P_{C}(x) \in C\) and \(\| x - P_{C}(x) \| = \mathrm{d}(x,C) := \inf_{y\in C} \| x-y \|\). \(P_{C}\) is firmly nonexpansive, i.e., \(\| P_{C}(x) - P_{C}(y) \|^{2} + \| (\mathrm{Id} - P_{C})(x) - ( \mathrm{Id} - P_{C})(y) \|^{2} \leq \| x-y \|^{2}\) for all \(x,y\in \mathbb{R}^{N}\), with \(\mathrm{Fix}(P_{C}) = C\) [10, Proposition 4.8, (4.8)]. The metric projection onto C under the H-norm is denoted by \(P_{C,H}\). When C is an affine subspace, half-space, or hyperslab, the projection onto C can be computed within a finite number of arithmetic operations [10, Chap. 28].

Convex stochastic optimization problem over fixed point set

This paper considers the following problem.

Problem 3.1

Assume that

  1. (A0)

    \((\mathsf{H}_{n})_{n\in \mathbb{N}}\) is the sequence in \(\mathbb{S}_{++}^{N} \cap \mathbb{D}^{N}\);

  2. (A1)

    \(Q_{\mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is quasinonexpansive under the \(\mathsf{H}_{n}\)-norm and \(X := \bigcap_{n\in \mathbb{N}} \mathrm{Fix}(Q_{\mathsf{H}_{n}})\) (C) is nonempty, where \(C \subset \mathbb{R}^{N}\) is a nonempty, closed convex set onto which the projection can be easily computed;

  3. (A2)

    \(f \colon \mathbb{R}^{N} \to \mathbb{R}\) defined for all \(x \in \mathbb{R}^{N}\) by \(f(x) := \mathbb{E}[F({x},\xi )]\) is well defined and convex, where ξ is a random vector whose probability distribution P is supported on a set \(\Xi \subset \mathbb{R}^{M}\) and \(F \colon \mathbb{R}^{N} \times \Xi \to \mathbb{R}\).

Then

$$\begin{aligned} \text{find } x^{\star }\in X^{\star }:= \Bigl\{ x^{\star }\in X \colon f \bigl(x^{\star }\bigr) = f^{\star }:= \inf_{x \in X} f(x) \Bigr\} , \end{aligned}$$

where one assumes that \(X^{\star }\) is nonempty.

Examples of \(Q_{\mathsf{H}_{n}}\) satisfying (A0) and (A1) are described in Sect. 3.1 and Example 4.1.

The following are sufficient conditions [5, (A1), (A2), (2.5)] for being able to solve Problem 3.1.

  1. (C1)

    There is an independent and identically distributed sample \(\xi _{0}, \xi _{1}, \ldots \) of realizations of the random vector ξ;

  2. (C2)

    There is an oracle which, for a given input point \((x, \xi ) \in \mathbb{R}^{N} \times \Xi \), returns a stochastic subgradient \(\mathsf{G}(x,\xi )\) such that \(\mathsf{g}(x) := \mathbb{E}[\mathsf{G}(x,\xi )]\) is well defined and is a subgradient of f at x, i.e., \(\mathsf{g}(x) \in \partial f (x)\);

  3. (C3)

    There exists a positive number M such that, for all \(x\in C\), \(\mathbb{E}[\|\mathsf{G}(x,\xi )\|^{2}] \leq M^{2}\).

Suppose that \(F(\cdot , \xi )\) (\(\xi \in \Xi \)) is convex and consider the oracle which returns a stochastic subgradient \(\mathsf{G}(x,\xi ) \in \partial _{x} F (x,\xi )\) for given \((x,\xi ) \in \mathbb{R}^{N} \times \Xi \). Then \(f (\cdot ) = \mathbb{E}[F(\cdot ,\xi )]\) is well defined and convex, and \(\partial f (x) = \mathbb{E}[\partial _{x} F(x,\xi )]\) [25, Theorem 7.51], [5, p.1575].

Related problems and their algorithms

Here, let us consider the following convex stochastic optimization problem [5, (1.1)]:

$$\begin{aligned} \text{minimize } f(x) = \mathbb{E}\bigl[F(x,\xi )\bigr] \text{ subject to } x \in C, \end{aligned}$$
(3)

where \(C \subset \mathbb{R}^{N}\) is nonempty, bounded, closed, and convex. The classical method for problem (3) under (C1)–(C3) is the stochastic approximation (SA) method [1, (5.4.1)], [2, Algorithm 8.1], [3] defined as follows: given \(x_{0}\in \mathbb{R}^{N}\) and \((\lambda _{n})_{n\in \mathbb{N}} \subset (0,+\infty )\),

$$\begin{aligned} x_{n+1} = P_{C} \bigl( x_{n} - \lambda _{n} \mathsf{G}(x_{n}, \xi _{n}) \bigr) \quad (n\in \mathbb{N}). \end{aligned}$$
(4)

The SA method requires the metric projection onto C, and hence can be applied only to cases where C is simple in the sense that \(P_{C}\) can be efficiently computed (e.g., C is a closed ball, half-space, or hyperslab [10, Chap. 28]). When C is not simple, the SA method requires solving the following subproblem at each iteration n:

$$\begin{aligned} \text{Find } x_{n+1} \in C \text{ such that } \{ x_{n+1} \} = \operatorname*{argmin}_{y\in C} \bigl\Vert \bigl( x_{n} - \lambda _{n} \mathsf{G}(x_{n}, \xi _{n}) \bigr) - y \bigr\Vert . \end{aligned}$$

The mirror descent SA method [4, Sects. 3 and 4], [5, Sect. 2.3] is useful for solving problem (3) and has been analyzed for the case of step-sizes that are constant or diminishing. For example, the mirror descent SA method [5, (2.32), (2.38), and (2.47)] with a constant step-size policy generates the following sequence \((\tilde{x}_{1}^{n})_{n\in \mathbb{N}}\): given \(x_{0}\in X^{o} := \{x\in \mathbb{R}^{N} \colon \partial \omega (x) \neq \emptyset \}\),

$$\begin{aligned} x_{n+1} = \operatorname*{argmin}_{z\in C} \bigl\{ \bigl\langle \gamma _{n} \mathsf{G}(x_{n},\xi _{n}), z-x_{n} \bigr\rangle + V(x_{n},z) \bigr\} , \qquad \tilde{x}_{1}^{n+1} := \sum _{t=1}^{n+1} \frac{\gamma _{t}}{\sum_{i=1}^{n+1} \gamma _{i}} x_{t}, \end{aligned}$$
(5)

where \(\omega \colon C \to \mathbb{R}\) is differentiable and convex, \(V \colon X^{o} \times C \to \mathbb{R}_{+}\) is defined for all \((x,z) \in X^{o} \times C\) by \(V(x,z) := \omega (z) - [\omega (x) + \langle \nabla \omega (x),z-x \rangle ]\), and \(\gamma _{t}\) (\(t\in \mathbb{N}\)) is a constant step-size. When \(\omega (\cdot ) = (1/2)\|\cdot \|^{2}\), \(x_{n+1}\) in (5) coincides with \(x_{n+1}\) in (4). Under certain assumptions, method (5) satisfies \(\mathbb{E} [ f(\tilde{x}_{1}^{n}) - f^{\star }] = \mathcal{O}(1/\sqrt{n})\) [5, (2.48)] (see [5, (2.57)] for the rate of convergence of the mirror descent SA method with a diminishing step-size policy).

As the field of deep learning has developed, it has produced some useful stochastic optimization algorithms, such as AdaGrad [7, Figs. 1 and 2], [2, Algorithm 8.4], RMSProp [2, Algorithm 8.5], and Adam [8, Algorithm 1], [2, Algorithm 8.7], for solving problem (3). The AdaGrad algorithm is based on the mirror decent SA method (5) (see also [7, (4)]), and the RMSProp algorithm is a variant of AdaGrad. The Adam algorithm is based on a combination of RMSProp and the momentum method [26, (9)], as follows: given \(x_{t}, m_{t-1}, v_{t-1} \in \mathbb{R}^{N}\),

$$\begin{aligned} \begin{aligned} &m_{t} := \beta _{1} m_{t-1} + (1 -\beta _{1}) \nabla _{x} F (x_{t}, \xi _{t}), \\ &v_{t} := \beta _{2} v_{t-1} + (1 -\beta _{2}) \nabla _{x} F (x_{t}, \xi _{t}) \odot \nabla _{x} F (x_{t},\xi _{t}), \\ &\hat{m}_{t} := \frac{m_{t}}{1 - \beta _{1}^{t+1}}, \qquad \hat{v}_{t} := \frac{v_{t}}{1 - \beta _{2}^{t+1}}, \\ &\mathsf{d}_{t} := - \mathsf{diag} \biggl( \frac{1}{\sqrt{\hat{v}_{t,i}} + \epsilon } \biggr) \hat{m}_{t} = - \biggl(\frac{\hat{m}_{t,i}}{\sqrt{\hat{v}_{t,i}} + \epsilon } \biggr)_{i=1}^{N}, \\ &x_{t+1} := P_{C} [x_{t} + \lambda _{t} \mathsf{d}_{t} ], \quad \text{i.e., } \{ x_{t+1} \} = \operatorname*{argmin}_{y\in C} \bigl\Vert (x_{t} + \lambda _{t} \mathsf{d}_{t}) - y \bigr\Vert , \end{aligned} \end{aligned}$$
(6)

where \(\beta _{i} > 0\) (\(i=1,2\)), \(\epsilon > 0\), \((\lambda _{n})_{n\in \mathbb{N}} \subset (0,1)\) is diminishing step-size, and \(A \odot B\) denotes the Hadamard product of matrices A and B. If we define matrix \(\mathsf{H}_{t}\) as

$$\begin{aligned} \mathsf{H}_{t} := \mathsf{diag} (\sqrt{ \hat{v}_{t,i}} + \epsilon ), \end{aligned}$$
(7)

then the Adam algorithm (6) can be expressed as

$$\begin{aligned} x_{n+1} = P_{C} \biggl[x_{t} - \lambda _{t} \mathsf{diag} \biggl( \frac{1}{\sqrt{\hat{v}_{t,i}} + \epsilon } \biggr) \hat{m}_{t} \biggr] = P_{C} \bigl[x_{t} - \lambda _{t} \mathsf{H}_{t}^{-1} \hat{m}_{t} \bigr]. \end{aligned}$$
(8)

Unfortunately, there exists an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution [9, Theorem 2]. To guarantee convergence and preserve the practical benefits of Adam, AMSGrad [9, Algorithm 2] was proposed as follows: for \((\beta _{1,t})_{t\in \mathbb{N}} \subset (0,+\infty )\),

$$\begin{aligned} \begin{aligned} &m_{t} := \beta _{1,t} m_{t-1} + (1 -\beta _{1,t}) \nabla _{x} F (x_{t}, \xi _{t}), \\ &v_{t} := \beta _{2} v_{t-1} + (1 -\beta _{2}) \nabla _{x} F (x_{t}, \xi _{t}) \odot \nabla _{x} F (x_{t},\xi _{t}), \\ &\hat{v}_{t} := (\hat{v}_{t,i}) = \bigl(\max \{ \hat{v}_{t-1,i}, v_{t,i}\}\bigr), \\ &\mathsf{H}_{t} := \mathsf{diag} (\sqrt{\hat{v}_{t,i}} ), \\ &\mathsf{d}_{t} := - \mathsf{H}_{t}^{-1} {m}_{t}, \\ &x_{t+1} := P_{C,\mathsf{H}_{t}} [x_{t} + \lambda _{t} \mathsf{d}_{t} ], \quad \text{i.e., } \{ x_{t+1} \} = \operatorname*{argmin}_{y\in C} \bigl\Vert (x_{t} + \lambda _{t} \mathsf{d}_{t}) - y \bigr\Vert _{\mathsf{H}_{t}}. \end{aligned} \end{aligned}$$
(9)

The existing SA methods (4), (5), (6), and (9) (see also [6, 27], [2, Sect. 8.5], and [5, Sect. 2.3]) require minimizing a certain convex function over C at each iteration. Therefore, when C has a complicated form (e.g., C is expressed as the set of all minimizers of a convex function over a closed convex set, the solution set of a monotone variational inequality, or the intersection of closed convex sets), it is difficult to compute the point \(x_{n+1}\) generated by any of (4), (5), (6), and (9) at each iteration.

Meanwhile, the fixed point theory [10, 2830] enables us to define a computable quasinonexpansive mapping of which the fixed point set is equal to the complicated set. For example, let \(\mathrm{lev}_{\leq 0} f_{i}\) (\(i=1,2,\ldots ,I\)) be the level set of a convex function \(f_{i} \colon \mathbb{R}^{N} \to \mathbb{R}\), and let X be the intersection of \(\mathrm{lev}_{\leq 0} f_{i}\), i.e.,

$$\begin{aligned} X := \bigcap_{i=1}^{I} \mathrm{lev}_{\leq 0} f_{i} = \bigcap _{i=1}^{I} \bigl\{ x\in \mathbb{R}^{N} \colon f_{i}(x) \leq 0 \bigr\} \neq \emptyset . \end{aligned}$$
(10)

Let \(n\in \mathbb{N}\) be fixed arbitrarily, and let \(\mathsf{H}_{n} \in \mathbb{S}^{N}_{++}\) (see (A0)). Let \(Q_{f_{i}, \mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) (\(i=1,2,\ldots ,I\)) be the subgradient projection defined by (1) with \(f:= f_{i}\) and \(H := \mathsf{H}_{n}\). Accordingly, Proposition 2.2 implies that \(Q_{f_{i}, \mathsf{H}_{n}}\) is firmly quasinonexpansive under the \(\mathsf{H}_{n}\)-norm and \(\mathrm{Fix}(Q_{f_{i}, \mathsf{H}_{n}}) = \mathrm{lev}_{\leq 0} f_{i}\) (\(i=1,2,\ldots ,I\)). Under the condition that the subgradients of \(f_{i}\) can be efficiently computed (see, e.g., [10, Chap. 16] for examples of convex functions with computable subgradients), \(Q_{f_{i}, \mathsf{H}_{n}}\) also can be computed. Here, let us define \(Q_{\mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) as

$$\begin{aligned} Q_{\mathsf{H}_{n}} := \sum_{i=1}^{I} \omega _{i} Q_{f_{i}, \mathsf{H}_{n}}, \end{aligned}$$
(11)

where \((\omega _{i})_{i=1}^{I} \subset (0,+\infty )\) satisfies \(\sum_{i=1}^{I} \omega _{i} = 1\). Then \(Q_{\mathsf{H}_{n}}\) is quasinonexpansive under the \(\mathsf{H}_{n}\)-norm [10, Exercise 4.11]. Moreover, we have that

$$\begin{aligned} X &= \bigcap_{i=1}^{I} \mathrm{lev}_{\leq 0} f_{i} = \bigcap _{i=1}^{I} \mathrm{Fix} (Q_{f_{i}} ) = \bigcap _{i=1}^{I} \bigcap _{n \in \mathbb{N}} \mathrm{Fix} (Q_{f_{i},\mathsf{H}_{n}} ) = \bigcap _{n\in \mathbb{N}} \mathrm{Fix} (Q_{\mathsf{H}_{n}}), \end{aligned}$$
(12)

where the second equality comes from Proposition 2.2(i) (i.e., \(\mathrm{Fix}(Q_{f_{i}}) = \mathrm{lev}_{\leq 0} f_{i}\) (\(i=1,2,\ldots ,I\))), the third equality comes from Proposition 2.2(ii) (i.e., \(\mathrm{Fix}(Q_{f_{i}}) = \mathrm{Fix}(Q_{f_{i}, \mathsf{H}_{n}})\) for all \(n\in \mathbb{N}\)), and the fourth equality comes from [10, Proposition 4.34]. Therefore, (10), (11), and (12) imply that we can define a computable mapping \(Q_{\mathsf{H}_{n}}\) satisfying (A1) of which the fixed point set is equal to the intersection of level sets. In the case where C is simple in the sense that \(P_{C} = P_{C,I}\) can be easily computed, \(I \succ O\) and \(Q := P_{C}\) obviously satisfy (A0) and (A1) with \(\mathrm{Fix}(P_{C}) =C=:X\). Accordingly, Problem 3.1 with \(Q := P_{C}\) coincides with problem (3), which implies that Problem 3.1 is a generalization of problem (3).

Fixed point algorithms exist for searching for a fixed point of a nonexpansive mapping [10, Chap. 5], [11, Chaps. 2–9], [1216]. The sequence \((x_{n})_{n\in \mathbb{N}}\) is generated by the Halpern fixed point algorithm [11, Subchapter 6.5], [12, 16] as follows: for all \(n\in \mathbb{N}\),

$$\begin{aligned} x_{n+1} := \alpha _{n} x_{0} + (1- \alpha _{n}) Q(x_{n}), \end{aligned}$$
(13)

where \(x_{0}\in \mathbb{R}^{N}\), \((\alpha _{n})_{n\in \mathbb{N}} \subset (0,1)\) satisfies \(\lim_{n\to +\infty } \alpha _{n} = 0\) and \(\sum_{n=0}^{+\infty } \alpha _{n} = +\infty \), and \(Q \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is nonexpansive with \(\mathrm{Fix}(Q) \neq \emptyset \). The sequence \((x_{n})_{n\in \mathbb{N}}\) in (13) converges to the minimizer of the specific convex function \(f_{0}(x) := (1/2)\|x - x_{0}\|^{2}\) (\(x\in \mathbb{R}^{N}\)) over \(\mathrm{Fix}(Q)\) (see, e.g., [11, Theorem 6.19]). From \(\nabla f_{0}(x) = x - x_{0}\) (\(x\in \mathbb{R}^{N}\)), the Halpern algorithm (13) can be expressed as follows (see [31, 32] for algorithms optimizing a general convex function):

$$\begin{aligned} x_{n+1} = Q(x_{n}) - \alpha _{n} \bigl(Q(x_{n}) - x_{0} \bigr) = Q(x_{n}) - \alpha _{n} \nabla f_{0} \bigl(Q(x_{n}) \bigr). \end{aligned}$$
(14)

The following algorithm obtained by combining the SA method (4) with (14) for solving Problem 3.1 follows naturally from the above discussion: for all \(n\in \mathbb{N}\),

$$\begin{aligned} {x}_{n+1} := P_{C} \bigl[Q_{\alpha }({x}_{n}) - \lambda _{n} \mathsf{G}\bigl(Q_{\alpha }(x_{n}), \xi _{n}\bigr) \bigr], \end{aligned}$$
(15)

where \(Q_{\alpha }:= \alpha \mathrm{Id} + (1-\alpha )Q\) (\(\alpha \in (0,1)\)). A convergence analysis of this algorithm for different step-size rules was performed in [17]. For example, algorithm (15) with a diminishing step-size was shown to converge in probability to a solution to Problem 3.1 with \(X = \mathrm{Fix}(Q)\) [17, Theorem III.2]. The advantage of algorithm (15) is that it allows convex stochastic optimization problems with complicated constraints to be solved (see also (12)). From the fact stated in [17, Problem II.1] that the classifier ensemble problem [18, 19], which is a central issue in machine learning, can be formulated as a convex stochastic optimization problem with complicated constraints, the classifier ensemble problem can be regarded as an example of Problem 3.1. This result implies that algorithm (15) can solve the classifier ensemble problem. However, this algorithm suffers from slow convergence, as shown in [17]. Specifically, although the learning methods based on algorithm (15) have higher accuracies than the previously proposed learning methods, they have longer elapsed times. Accordingly, we should consider developing stochastic optimization techniques to accelerate algorithm (15). This paper proposes an algorithm (Algorithm 1) based on useful stochastic gradient descent algorithms, such as Adam [8, Algorithm 1] and AMSGrad [9, Algorithm 2], for solving Problem 3.1, as a replacement for the existing stochastic first-order method [17].

Proposed algorithm

Before giving some examples, we first prove the following lemma listing the basic properties of Algorithm 1.

Lemma 4.1

Suppose that \(\mathsf{H}_{n} \in \mathbb{S}_{++}^{N}\) (\(n\in \mathbb{N}\)), (A1), (A2), (C1), and (C2) hold and consider the sequence \((x_{n})_{n\in \mathbb{N}}\) defined for all \(n\in \mathbb{N}\) by Algorithm 1. Then, for all \(x\in X\) and all \(n\in \mathbb{N}\),

$$\begin{aligned} &\mathbb{E} \bigl[ \Vert x_{n+1} - x \Vert _{\mathsf{H}_{n}}^{2} \bigr] \\ &\quad \leq \mathbb{E} \bigl[ \Vert x_{n} - x \Vert _{\mathsf{H}_{n}}^{2} \bigr] + 2 (1-\alpha _{n}) \lambda _{n} \bigl\{ (1 - \beta _{n}) \mathbb{E} \bigl[ f(x) - f(x_{n}) \bigr] \\ &\qquad {}+ \beta _{n} \mathbb{E} \bigl[ \langle x - x_{n},m_{n-1} \rangle \bigr] \bigr\} + (1-\alpha _{n}) \lambda _{n}^{2} \mathbb{E} \bigl[ \Vert \mathsf{d}_{n} \Vert _{\mathsf{H}_{n}}^{2} \bigr] \\ &\qquad{} - \alpha _{n} \mathbb{E} \bigl[ \Vert x_{n+1} - x_{n} \Vert _{\mathsf{H}_{n}}^{2} \bigr] - (1 - \alpha _{n}) \mathbb{E} \bigl[ \Vert x_{n+1} - y_{n} \Vert _{\mathsf{H}_{n}}^{2} \bigr]. \end{aligned}$$

Moreover, under (C3), \(\mathbb{E}[\|m_{n}\|^{2}] \leq \tilde{M}^{2} := \max \{ \|m_{-1}\|^{2},M^{2} \}\) holds for all \(n\in \mathbb{N}\). If

  1. (A3)

    \(h_{\star }:= \sup \{\max_{i=1,2,\ldots ,N} h_{n,i}^{-1/2} \colon n \in \mathbb{N} \}\) is finite, where \(\mathsf{H}_{n} := \mathsf{diag}(h_{n,i})\),

then \(\mathbb{E}[\|\mathsf{d}_{n}\|_{\mathsf{H}_{n}}^{2}] \leq h_{\star }^{2} \tilde{M}^{2}\) holds for all \(n\in \mathbb{N}\).

Proof

Let \(x\in X \subset C\) and \(n\in \mathbb{N}\) be fixed arbitrarily. The definition of \(x_{n+1}\) and the firm nonexpansivity of \(P_{C,\mathsf{H}_{n}}\) guarantee that, almost surely,

$$\begin{aligned} & \Vert x_{n+1} - x \Vert _{\mathsf{H}_{n}}^{2} \\ &\quad \leq \bigl\Vert \bigl[\alpha _{n} x_{n} + (1-\alpha _{n}) y_{n} \bigr] - x \bigr\Vert _{\mathsf{H}_{n}}^{2} - \bigl\Vert x_{n+1} - \bigl[\alpha _{n} x_{n} + (1-\alpha _{n}) y_{n} \bigr] \bigr\Vert _{ \mathsf{H}_{n}}^{2}, \end{aligned}$$

which, together with \(\| \alpha x + (1-\alpha ) y \|^{2} = \alpha \|x\|^{2} + (1-\alpha ) \|y\|^{2} - \alpha (1-\alpha )\|x-y\|^{2}\) (\(x,y\in \mathbb{R}^{N}\), \(\alpha \in \mathbb{R}\)), implies that

$$\begin{aligned} \begin{aligned} \Vert x_{n+1} - x \Vert _{\mathsf{H}_{n}}^{2} &\leq \alpha _{n} \Vert x_{n} - x \Vert _{\mathsf{H}_{n}}^{2} + (1 - \alpha _{n}) \Vert y_{n} - x \Vert _{\mathsf{H}_{n}}^{2} - \alpha _{n} \Vert x_{n+1} - x_{n} \Vert _{\mathsf{H}_{n}}^{2} \\ &\quad{} - (1 -\alpha _{n}) \Vert x_{n+1} - y_{n} \Vert _{ \mathsf{H}_{n}}^{2}. \end{aligned} \end{aligned}$$
(16)

The definition of \(y_{n}\) and (A1) ensure that, almost surely,

$$\begin{aligned} \Vert y_{n} - x \Vert _{\mathsf{H}_{n}}^{2} &\leq \bigl\Vert (x_{n} -x) + \lambda _{n} \mathsf{d}_{n} \bigr\Vert _{\mathsf{H}_{n}}^{2} \\ &= \Vert x_{n} -x \Vert _{\mathsf{H}_{n}}^{2} + 2 \lambda _{n} \langle x_{n} - x, \mathsf{d}_{n} \rangle _{\mathsf{H}_{n}} + \lambda _{n}^{2} \Vert \mathsf{d}_{n} \Vert _{\mathsf{H}_{n}}^{2}. \end{aligned}$$

The definitions of \(\mathsf{d}_{n}\) and \(m_{n}\) in turn ensure that

$$\begin{aligned} \langle x_{n} - x, \mathsf{d}_{n} \rangle _{\mathsf{H}_{n}} &= \langle x - x_{n}, m_{n} \rangle \\ &= \beta _{n} \langle x - x_{n}, m_{n-1} \rangle + (1- \beta _{n}) \bigl\langle x - x_{n}, \mathsf{G}(x_{n}, \xi _{n}) \bigr\rangle . \end{aligned}$$

Hence, (16) implies that, almost surely,

$$\begin{aligned} \Vert x_{n+1} - x \Vert _{\mathsf{H}_{n}}^{2} &\leq \alpha _{n} \Vert x_{n} - x \Vert _{\mathsf{H}_{n}}^{2} + (1 - \alpha _{n}) \bigl\{ \Vert x_{n} -x \Vert _{\mathsf{H}_{n}}^{2} + 2 \lambda _{n} \langle x_{n} - x, \mathsf{d}_{n} \rangle _{\mathsf{H}_{n}} \\ &\quad{} + \lambda _{n}^{2} \Vert \mathsf{d}_{n} \Vert _{ \mathsf{H}_{n}}^{2} \bigr\} - \alpha _{n} \Vert x_{n+1} - x_{n} \Vert _{\mathsf{H}_{n}}^{2} - (1 - \alpha _{n}) \Vert x_{n+1} - y_{n} \Vert _{\mathsf{H}_{n}}^{2} \\ &= \Vert x_{n} -x \Vert _{\mathsf{H}_{n}}^{2} + 2 (1 - \alpha _{n}) \lambda _{n} \bigl\{ \beta _{n} \langle x - x_{n}, m_{n-1} \rangle \\ &\quad{} + (1-\beta _{n}) \bigl\langle x - x_{n}, \mathsf{G}(x_{n}, \xi _{n}) \bigr\rangle \bigr\} + (1-\alpha _{n}) \lambda _{n}^{2} \Vert \mathsf{d}_{n} \Vert _{\mathsf{H}_{n}}^{2} \\ &\quad{} - \alpha _{n} \Vert x_{n+1} - x_{n} \Vert _{\mathsf{H}_{n}}^{2} - (1 -\alpha _{n}) \Vert x_{n+1} - y_{n} \Vert _{\mathsf{H}_{n}}^{2}. \end{aligned}$$
(17)

Moreover, the condition \(x_{n} = x_{n}(\xi _{[n-1]})\) (\(n\in \mathbb{N}\)) and (C1) guarantee that

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\langle x - x_{n}, \mathsf{G}(x_{n}, \xi _{n}) \bigr\rangle \bigr] &= \mathbb{E} \bigl[ \mathbb{E} \bigl[ \bigl\langle x - x_{n}, \mathsf{G}(x_{n},\xi _{n}) \bigr\rangle | \xi _{[n-1]} \bigr] \bigr] \\ &= \mathbb{E} \bigl[ \bigl\langle x - x_{n}, \mathbb{E} \bigl[ \mathsf{G}(x_{n},\xi _{n}) | \xi _{[n-1]} \bigr] \bigr\rangle \bigr] \\ &= \mathbb{E} \bigl[ \bigl\langle x - x_{n}, \mathsf{g} (x_{n}) \bigr\rangle \bigr], \end{aligned}$$

which, together with (C2), implies that

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\langle x - x_{n}, \mathsf{G}(x_{n}, \xi _{n}) \bigr\rangle \bigr] \leq \mathbb{E} \bigl[ f(x) - f(x_{n}) \bigr]. \end{aligned}$$

Therefore, taking the expectation of (17) gives the first assertion of Lemma 4.1.

The definition of \(m_{n}\) and (C3), together with the convexity of \(\|\cdot \|^{2}\), guarantee that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \mathbb{E} \bigl[ \Vert m_{n} \Vert ^{2} \bigr] &\leq \beta _{n} \mathbb{E} \bigl[ \Vert m_{n-1} \Vert ^{2} \bigr] + (1-\beta _{n}) \mathbb{E} \bigl[ \bigl\Vert \mathsf{G}(x_{n}, \xi _{n}) \bigr\Vert ^{2} \bigr] \\ &\leq \beta _{n} \mathbb{E} \bigl[ \Vert m_{n-1} \Vert ^{2} \bigr] + (1-\beta _{n}) M^{2}. \end{aligned}$$

Induction thus ensures that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \mathbb{E} \bigl[ \Vert m_{n} \Vert ^{2} \bigr] \leq \tilde{M}^{2} := \max \bigl\{ \Vert m_{-1} \Vert ^{2},M^{2} \bigr\} < + \infty . \end{aligned}$$
(18)

Given \(n\in \mathbb{N}\), \(\mathsf{H}_{n} \succ O\) ensures that there exists a unique matrix \(\overline{\mathsf{H}}_{n} \succ O\) such that \(\mathsf{H}_{n} = \overline{\mathsf{H}}_{n}^{2}\) [33, Theorem 7.2.6]. Since \(\|x\|_{\mathsf{H}_{n}}^{2} = \| \overline{\mathsf{H}}_{n} x \|^{2}\) holds for all \(x\in \mathbb{R}^{N}\), the definition of \(\mathsf{d}_{n}\) implies that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \mathbb{E} \bigl[ \Vert \mathsf{d}_{n} \Vert _{\mathsf{H}_{n}}^{2} \bigr] = \mathbb{E} \bigl[ \bigl\Vert \overline{\mathsf{H}}_{n}^{-1} \mathsf{H}_{n}\mathsf{d}_{n} \bigr\Vert ^{2} \bigr] \leq \mathbb{E} \bigl[ \bigl\Vert \overline{\mathsf{H}}_{n}^{-1} \bigr\Vert ^{2} \Vert m_{n} \Vert ^{2} \bigr], \end{aligned}$$

where \(\| \overline{\mathsf{H}}_{n}^{-1} \| = \| \mathsf{diag}(h_{n,i}^{-1/2}) \| = \max_{i=1,2,\ldots ,N} h_{n,i}^{-1/2}\) (\(n\in \mathbb{N}\)). From (18) and

$$h_{\star }:= \sup \Bigl\{ \max_{i=1,2,\ldots ,N} h_{n,i}^{-1/2} \colon n \in \mathbb{N} \Bigr\} < + \infty $$

(by (A3)), we have that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \mathbb{E} \bigl[ \Vert \mathsf{d}_{n} \Vert _{\mathsf{H}_{n}}^{2} \bigr] \leq h_{\star }^{2} \tilde{M}^{2}. \end{aligned}$$

This completes the proof. □

The convergence analyses of Algorithm 1 in Sect. 5 depend on the following assumption:

  1. (A4)

    [5, p.1574], [9, p.2] C (X) is bounded.

Let us consider the case where \(\mathsf{H}_{n}\) and \(v_{n}\) are defined for all \(n\in \mathbb{N}\) by

$$\begin{aligned} \begin{aligned} &v_{n} := \beta v_{n-1} + (1 - \beta ) \mathsf{G}(x_{n},\xi _{n}) \odot \mathsf{G}(x_{n},\xi _{n}), \\ &\hat{v}_{n} := (\hat{v}_{n,i} ) = \bigl(\max \{ \hat{v}_{n-1,i}, v_{n,i} \} \bigr), \\ &\mathsf{H}_{n} := \mathsf{diag} ( \sqrt{\hat{v}_{n,i}} ), \end{aligned} \end{aligned}$$
(19)

where \(\beta \in (0,1)\) and \(v_{-1} = \hat{v}_{-1} = 0 \in \mathbb{R}^{N}\) (see also (9)), and discuss the relationship between (A3) and (A4). Assumption (A4) implies that \((x_{n})_{n\in \mathbb{N}} \subset C\) generated by Algorithm 1 is almost surely bounded. In the standard case of \(\mathsf{G}(x_{n},\xi _{n}) \in \partial _{x} F(x_{n}, \xi _{n})\), Proposition 2.1 and (A4) imply that \((\mathsf{G}(x_{n},\xi _{n}))_{n\in \mathbb{N}}\) is almost surely bounded, i.e., \({M}_{1} := \sup_{n\in \mathbb{N}} \| \mathsf{G}(x_{n},\xi _{n}) \odot \mathsf{G}(x_{n},\xi _{n}) \| < + \infty \). Since the triangle inequality and (19) guarantee that, almost surely, \(\|v_{n} \| \leq \beta \|v_{n-1} \| + (1 - \beta ) \|\mathsf{G}(x_{n}, \xi _{n}) \odot \mathsf{G}(x_{n},\xi _{n}) \|\), induction shows that, for all \(n\in \mathbb{N}\), almost surely, \(\|v_{n}\| \leq {M}_{1} < + \infty \). Accordingly, (19) leads to the almost sure boundedness of \((\hat{v}_{n})_{n\in \mathbb{N}}\). Hence, \(h_{\star }:= \sup \{ \max_{i=1,2,\ldots ,N} \sqrt{\hat{v}_{n,i}} \colon n\in \mathbb{N} \}< + \infty \), which implies that (A3) holds. The above discussion shows that (A4) implies (A3) when \(\mathsf{H}_{n}\) and \(v_{n}\) are as follows (see also (6) and (7)):

$$\begin{aligned} \begin{aligned} &v_{n} := \beta v_{n-1} + (1 - \beta ) \mathsf{G}(x_{n},\xi _{n}) \odot \mathsf{G}(x_{n},\xi _{n}), \\ &\hat{v}_{n} := (\hat{v}_{n,i}) = \biggl(\max \biggl\{ \frac{v_{n,i}}{1 - \beta ^{n+1}}, \hat{v}_{n-1,i} \biggr\} \biggr), \\ &\mathsf{H}_{n} := \mathsf{diag} ( \sqrt{\hat{v}_{n,i}} ). \end{aligned} \end{aligned}$$
(20)

We provide some examples of Problem 3.1 with (A0)–(A4) that can be solved by Algorithm 1 under (C1)–(C3).

Example 4.1

(i) Deep learning problem [9, p.2]: At each time step t, stochastic optimization algorithms used in training deep networks pick a point \(x_{t} \in X\) with the parameters of the model to be learned, where \(X \subset \mathbb{R}^{N}\) is the simple, nonempty, bounded, closed convex feasible set of points, and then incur loss \(f_{t} (x_{t})\), where \(f_{t} \colon \mathbb{R}^{N} \to \mathbb{R}\) is a convex loss function represented as the loss of the model with the chosen parameters in the next minibatch. Accordingly, the stochastic optimization problem in deep networks can be formulated as follows:

$$\begin{aligned} \text{minimize } \sum_{t=1}^{T} f_{t}(x) \text{ subject to } x\in X = \mathrm{Fix}(P_{X})= \bigcap_{n\in \mathbb{N}} \mathrm{Fix} (P_{X, \mathsf{H}_{n}} ), \end{aligned}$$
(21)

where T is the total number of rounds in the learning process, and \((\mathsf{H}_{n})_{n\in \mathbb{N}} \subset \mathbb{S}_{++}^{N} \cap \mathbb{D}^{N}\) defined by each of (19) and (20) satisfies (A0). \(Q_{\mathsf{H}_{n}} := P_{X,\mathsf{H}_{n}}\) (\(n\in \mathbb{N}\)) satisfies (A1), and \(f(\cdot ) = \mathbb{E}[f_{\xi } (\cdot )]:= (1/T) \sum_{t=1}^{T} f_{t}( \cdot )\) satisfies (A2). Setting \(C := X\) ensures (A4), which implies (A3). Algorithm 1 for solving problem (21) is as follows:

$$\begin{aligned} x_{n+1} := \alpha _{n} x_{n} + (1- \alpha _{n}) P_{X,\mathsf{H}_{n}} \bigl(x_{n} - \lambda _{n} \mathsf{H}_{n}^{-1} m_{n} \bigr). \end{aligned}$$
(22)

(ii) Classifier ensemble problem [18, Sect. 2.2.2], [19, Sect. 3.2.2] (see also [17, Problem II.1]): For a training set \(S =\{ (z_{m}, l_{m})\}_{m=1}^{M} \subset \mathbb{R}^{N} \times \mathbb{R}\), where \(z_{m} := (z_{m}^{n})_{n=1}^{N}\) and \(z_{m}^{n}\) is the measure corresponding to the mth sample in the sample set and the nth classifier in an ensemble. The classifier ensemble problem with sparsity learning is the following:

$$\begin{aligned} \begin{aligned} &\text{minimize } f(x) = \mathbb{E} \biggl[\frac{1}{2}\bigl(\langle z, x \rangle - l\bigr)^{2} \biggr] \\ &\text{subject to } x \in X := \mathbb{R}_{+}^{N} \cap \bigl\{ x \in \mathbb{R}^{N} \colon \Vert x \Vert _{1} \leq t_{1} \bigr\} , \end{aligned} \end{aligned}$$
(23)

where \(\|\cdot \|_{1}\) denotes the \(\ell _{1}\)-norm and \(t_{1}\) is the sparsity control parameter. Suppose that \(\mathsf{H}_{n}\) is as each of (19) and (20), which satisfies (A0), and define a mapping \(Q_{\mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) by

$$\begin{aligned} Q_{\mathsf{H}_{n}} := P_{\mathbb{R}_{+}^{N}, \mathsf{H}_{n}} P_{\{ x \in \mathbb{R}^{N} \colon \|x\|_{1} \leq t_{1} \}, \mathsf{H}_{n}}. \end{aligned}$$
(24)

Since the projections \(P_{\mathbb{R}_{+}^{N}, \mathsf{H}_{n}}\) and \(P_{\{ x \in \mathbb{R}^{N} \colon \|x\|_{1} \leq t_{1} \}, \mathsf{H}_{n}}\) can be easily computed [34, Lemma 1.1], \(Q_{\mathsf{H}_{n}}\) defined by (24) can be also computed. Moreover, \(Q_{\mathsf{H}_{n}}\) defined by (24) is nonexpansive with \(X= \bigcap_{n\in \mathbb{N}} \mathrm{Fix}(Q_{\mathsf{H}_{n}})\), i.e., (A1) holds. Since \(\{ x \in \mathbb{R}^{N} \colon \|x\|_{1} \leq t_{1} \}\) is bounded, we can set a simple, bounded set C such that \(X \subset C\), i.e., (A4) holds. Moreover, f in problem (23) satisfies (A2).

The classifier ensemble problem with both sparsity and diversity learning is as follows:

$$\begin{aligned} \begin{aligned} &\text{minimize } f(x) = \mathbb{E} \biggl[\frac{1}{2}\bigl(\langle z, x \rangle - l\bigr)^{2} \biggr] \\ &\text{subject to } x \in X := \bigl\{ x \in \mathbb{R}_{+}^{N} \colon \Vert x \Vert _{1} \leq t_{1} \bigr\} \cap \bigl\{ x \in \mathbb{R}^{N} \colon f_{\mathrm{div}}(x) \geq t_{2} \bigr\} , \end{aligned} \end{aligned}$$
(25)

where \(t_{2}\) is the diversity control parameter, \(f_{\mathrm{div}}(x) := \sum_{m=1}^{M} \{ \langle [z_{m}], x \rangle - \langle z_{m}, x \rangle ^{2} \}\) (\(x\in \mathbb{R}^{N}\)), and \([z_{m}] := ((z_{m}^{i})^{2})_{i=1}^{N} \in \mathbb{R}^{N}\). From the discussion regarding (10), (11), and (12), a mapping

$$\begin{aligned} Q_{\mathsf{H}_{n}} := \omega _{1} P_{\mathbb{R}_{+}^{N}, \mathsf{H}_{n}} + \omega _{2} Q_{\| \cdot \|_{1} - t_{1}, \mathsf{H}_{n}} + \omega _{3} Q_{- f_{\mathrm{div}}(\cdot ) + t_{2}, \mathsf{H}_{n}}, \end{aligned}$$
(26)

with \((\mathsf{H}_{n})_{n\in \mathbb{N}} \subset \mathbb{S}_{++}^{N} \cap \mathbb{D}^{N}\) defined by each of (19) and (20), is quasinonexpansive under the \(\mathsf{H}_{n}\)-norm satisfying \(X = \bigcap_{n\in \mathbb{N}} \mathrm{Fix}(Q_{\mathsf{H}_{n}})\), i.e., (A1) holds. The discussion in the previous paragraph implies that (A0), (A2), and (A4) again hold.

Algorithm 1 for solving each of problems (23) and (25) is represented as follows:

$$\begin{aligned} x_{n+1} := P_{C,\mathsf{H}_{n}} \bigl[ \alpha _{n} x_{n} + (1- \alpha _{n}) Q_{\mathsf{H}_{n}} \bigl(x_{n} - \lambda _{n} \mathsf{H}_{n}^{-1} m_{n} \bigr) \bigr]. \end{aligned}$$
(27)

In contrast to Adam (6) and AMSGrad (9) that can solve a convex stochastic optimization problem with a simple constraint (3) (see also problem (21)), algorithm (27) can be applied to a convex stochastic optimization problem with complicated constraints, such as problems (23) and (25).

(iii) Network utility maximization problem [35, (6), (7)] (see also [36, Problem II.1]): The network resource allocation problem is to determine the source rates that maximize the utility aggregated over all sources over the link capacity constraints and source constraints. This problem can be formulated as the following network utility maximization problem:

$$\begin{aligned} \text{maximize } \sum_{s\in \mathcal{S}} u_{s} (x_{s}) \text{ subject to } x = (x_{s})_{s\in \mathcal{S}} \in X := \bigcap_{l \in \mathcal{L}} C_{l} \cap \bigcap _{s\in \mathcal{S}} C_{s}, \end{aligned}$$
(28)

where \(x_{s}\) denotes the transmission rate of source \(s \in \mathcal{S}\), \(u_{s}\) is a concave utility function of source s, \(\mathcal{S}(l)\) denotes the set of sources that use link \(l \in \mathcal{L}\), \(C_{l}\) is the capacity constraint set of link l having capacity \(c_{l} \in \mathbb{R}_{+}\) defined by \(C_{l} := \{ x= (x_{s})_{s\in \mathcal{S}} \colon \sum_{s \in \mathcal{S}(l)} x_{s} \leq c_{l} \}\), and \(C_{s}\) is the constraint set of source s having the maximum allowed rate \(M_{s}\) defined by \(C_{s} := \{ x= (x_{s})_{s\in \mathcal{S}} \colon x_{s} \in [0,M_{s}] \}\). Since \(C_{l}\) and \(C_{s}\) are half-spaces, the projections \(P_{C_{l}, \mathsf{H}_{n}}\) and \(P_{C_{s}, \mathsf{H}_{n}}\) are easily computed,Footnote 3 where \((\mathsf{H}_{n})_{n\in \mathbb{N}} \subset \mathbb{S}_{++}^{N} \cap \mathbb{D}^{N}\) is defined by each of (19) and (20). For example, we can define a nonexpansive mapping \(Q_{\mathsf{H}_{n}} := \prod_{l\in \mathcal{L}} P_{C_{l},\mathsf{H}_{n}} \prod_{s\in \mathcal{S}} P_{C_{s}, \mathsf{H}_{n}}\) satisfying \(X = \bigcap_{n\in \mathbb{N}} \mathrm{Fix}(Q_{\mathsf{H}_{n}})\). The boundedness of \(\bigcap_{s\in \mathcal{S}} C_{s}\) allows us to set a simple, bounded set C satisfying \(C \supset \bigcap_{s\in \mathcal{S}} C_{s} \supset X\). Algorithm (27) with \(\mathsf{G}(x_{n},\xi _{n}) \in \partial (-u_{\xi _{n}})(x_{n})\) can be applied to problem (28).

Convergence analyses and comparisons

Convergence analyses of Algorithm 1

For convergence analyses of Algorithm 1, we prove the following theorem.

Theorem 5.1

Suppose that (A0)(A4) and (C1)(C3) hold and that \((\alpha _{n})_{n\in \mathbb{N}}\), \((\beta _{n})_{n\in \mathbb{N}}\), \((\lambda _{n})_{n\in \mathbb{N}}\), and \((\gamma _{n})_{n\in \mathbb{N}}\) defined by \(\gamma _{n} := (1-\alpha _{n})(1-\beta _{n})\lambda _{n}\) (\(n\in \mathbb{N}\)) satisfy

$$\begin{aligned} 0 < \liminf_{n\to +\infty } \alpha _{n} \leq \limsup_{n\to + \infty } \alpha _{n} < 1, \qquad \limsup _{n\to + \infty } \beta _{n} < 1, \quad \textit{and}\quad \gamma _{n+1} \leq \gamma _{n} \quad (n\in \mathbb{N}) \end{aligned}$$
(29)

and that \(\mathsf{H}_{n} = \mathsf{diag}(h_{n,i})\) satisfies Footnote 4

$$\begin{aligned} h_{n+1,i} \geq h_{n,i} \quad (n\in \mathbb{N}, i=1,2,\ldots ,N). \end{aligned}$$
(30)

Then Algorithm 1 is such that the following are satisfied for all \(n \geq 1\):

$$\begin{aligned} \mathbb{E} \bigl[ f (\tilde{x}_{n} ) - f^{\star } \bigr] &\leq \frac{D}{2 \tilde{a} \tilde{b}n \lambda _{n}} \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr] + \frac{\tilde{M}\sqrt{DN}}{\tilde{b}n} \sum_{k=1}^{n} \beta _{k} + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 \tilde{b} n} \sum_{k=1}^{n} \lambda _{k}, \end{aligned}$$

where \(\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k}\), and \(h_{\star }\) are defined as in Lemma 4.1,

$$D := \max_{i=1,2,\ldots ,N} \sup \bigl\{ (x_{k+1, i} - x_{i})^{2} \colon k\in \mathbb{N} \bigr\} < + \infty ,$$

\((\alpha _{n})_{n\in \mathbb{N}} \subset [c,a] \subset (0,1)\), \((\beta _{n})_{n\in \mathbb{N}} \subset (0,b] \subset (0,1)\), \(\tilde{a} := 1 - a\), \(\tilde{b} := 1 - b\), \(\tilde{c} := 1-c\), and \(\hat{M} := \sup \{\mathbb{E}[f(x) - f(x_{n})] \colon n\in \mathbb{N} \} < + \infty \) for some \(x\in X\). If

  1. (A1)’

    \(Q_{\mathsf{H}_{n}} \colon \mathbb{R}^{N} \to \mathbb{R}^{N}\) is nonexpansive under the \(\mathsf{H}_{n}\)-norm,

then, for all \(n \geq 1\),

$$\begin{aligned} &\frac{1}{n} \sum_{k=1}^{n} \mathbb{E} \bigl[ \bigl\Vert x_{k} - Q_{ \mathsf{H}_{k}} (x_{k}) \bigr\Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\quad \leq 4 \biggl( \frac{1}{\tilde{a}} + \frac{1}{c} \biggr) \Biggl\{ \frac{D}{n} \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr] + \frac{2\tilde{c} \hat{M}}{n} \sum_{k=1}^{n} (1-\beta _{k})\lambda _{k} + \frac{2 \tilde{c} \tilde{M} \sqrt{DN}}{n} \sum _{k=1}^{n} \beta _{k} \lambda _{k} \Biggr\} \\ &\qquad{} + \biggl\{ 4 \biggl( \frac{1}{\tilde{a}} + \frac{1}{c} \biggr) \tilde{c} + 2 \biggr\} \frac{h_{\star }^{2} \tilde{M}^{2}}{n} \sum_{k=1}^{n} \lambda _{k}^{2}. \end{aligned}$$

Proof

Let \(x\in X\) be fixed arbitrarily. Lemma 4.1 guarantees that, for all \(k\in \mathbb{N}\),

$$\begin{aligned} \mathbb{E} \bigl[ f(x_{k}) - f(x) \bigr] &\leq \frac{1}{2\gamma _{k}} \bigl\{ \mathbb{E} \bigl[ \Vert x_{k} - x \Vert _{\mathsf{H}_{k}}^{2} \bigr] - \mathbb{E} \bigl[ \Vert x_{k+1} - x \Vert _{\mathsf{H}_{k}}^{2} \bigr] \bigr\} \\ &\quad{} + \frac{\beta _{k}}{1 - \beta _{k}} \mathbb{E} \bigl[ \langle x - x_{k},m_{k-1} \rangle \bigr] + \frac{\lambda _{k}}{2 (1-\beta _{k})} \mathbb{E} \bigl[ \Vert \mathsf{d}_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr]. \end{aligned}$$

Summing the above inequality ensures that, for all \(n \geq 1\),

$$\begin{aligned} \begin{aligned} &\frac{1}{n} \sum _{k=1}^{n} \mathbb{E} \bigl[ f(x_{k}) - f(x) \bigr] \\ &\quad \leq \frac{1}{2 n} \underbrace{\sum_{k=1}^{n} \frac{1}{\gamma _{k}} \bigl\{ \mathbb{E} \bigl[ \Vert x_{k} - x \Vert _{\mathsf{H}_{k}}^{2} \bigr] - \mathbb{E} \bigl[ \Vert x_{k+1} - x \Vert _{\mathsf{H}_{k}}^{2} \bigr] \bigr\} }_{\Gamma _{n}} \\ &\qquad{} + \frac{1}{n} \underbrace{ \sum_{k=1}^{n} \frac{\beta _{k}}{1-\beta _{k}} \mathbb{E} \bigl[ \langle x - x_{k},m_{k-1} \rangle \bigr]}_{B_{n}} + \frac{1}{2 \tilde{b}n} \underbrace{\sum _{k=1}^{n} \lambda _{k} \mathbb{E} \bigl[ \Vert \mathsf{d}_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr]}_{ \Lambda _{n}}, \end{aligned} \end{aligned}$$
(31)

where (29) implies that \(b > 0\) exists such that, for all \(n\in \mathbb{N}\), \(\beta _{n} \leq b < 1\) and \(\tilde{b} := 1 -b\). The definition of \(\Gamma _{n}\) and \(\mathbb{E} [ \| x_{n+1} - x \|_{\mathsf{H}_{n}}^{2}]/\gamma _{n} \geq 0\) imply that

$$\begin{aligned} \Gamma _{n} &\leq \frac{\mathbb{E} [ \Vert x_{1} - x \Vert _{\mathsf{H}_{1}}^{2} ]}{\gamma _{1}} + \underbrace{ \sum_{k=2}^{n} \biggl\{ \frac{\mathbb{E} [ \Vert x_{k} - x \Vert _{\mathsf{H}_{k}}^{2} ]}{\gamma _{k}} - \frac{\mathbb{E} [ \Vert x_{k} - x \Vert _{\mathsf{H}_{k-1}}^{2} ]}{\gamma _{k-1}} \biggr\} }_{\tilde{\Gamma }_{n}}. \end{aligned}$$
(32)

Given \(k\in \mathbb{N}\), \(\mathsf{H}_{k} \succ O\) ensures that there exists a unique matrix \(\overline{\mathsf{H}}_{k} \succ O\) such that \(\mathsf{H}_{k} = \overline{\mathsf{H}}_{k}^{2}\) [33, Theorem 7.2.6]. Since \(\|x\|_{\mathsf{H}_{k}}^{2} = \| \overline{\mathsf{H}}_{k} x \|^{2}\) holds for all \(x\in \mathbb{R}^{N}\), we have that, for all \(k\in \mathbb{N}\),

$$\begin{aligned} \tilde{\Gamma }_{n} = \mathbb{E} \Biggl[ \sum _{k=2}^{n} \biggl\{ \frac{ \Vert \overline{\mathsf{H}}_{k} (x_{k} - x) \Vert ^{2}}{\gamma _{k}} - \frac{ \Vert \overline{\mathsf{H}}_{k-1} (x_{k} - x) \Vert ^{2}}{\gamma _{k-1}} \biggr\} \Biggr]. \end{aligned}$$
(33)

Since \(\mathsf{H}_{k}\) (\(k\in \mathbb{N}\)) is diagonal, we can express \(\mathsf{H}_{k}\) as \(\mathsf{H}_{k} = \mathsf{diag}(h_{k,i})\), where \(h_{k,i} > 0\) (\(k\in \mathbb{N}\), \(i=1,2,\ldots ,N\)). Accordingly, for all \(k\in \mathbb{N}\) and all \(x := (x_{i})_{i=1}^{N} \in \mathbb{R}^{N}\),

$$\begin{aligned} \overline{\mathsf{H}}_{k} = \mathsf{diag} \bigl(h_{k,i}^{\frac{1}{2}} \bigr) \quad \text{and}\quad \Vert \overline{\mathsf{H}}_{k} x \Vert ^{2} = \sum _{i=1}^{N} h_{k,i} x_{i}^{2}. \end{aligned}$$
(34)

Hence, (33) ensures that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \tilde{\Gamma }_{n} = \mathbb{E} \Biggl[ \sum _{k=2}^{n} \sum_{i=1}^{N} \biggl( \frac{h_{k,i}}{\gamma _{k}} - \frac{h_{k-1,i}}{\gamma _{k-1}} \biggr) (x_{k,i} - x_{i})^{2} \Biggr]. \end{aligned}$$

From \(\gamma _{k} \leq \gamma _{k-1}\) (\(k\geq 1\)) (see (29)) and (30), we have that \(h_{k,i}/\gamma _{k} - h_{k-1,i}/\gamma _{k-1} \geq 0\) (\(k \geq 1\), \(i=1,2,\ldots ,N\)). Moreover, (A4) implies that \(D := \max_{i=1,2,\ldots ,N} \sup \{ (x_{n,i} - x_{i})^{2} \colon n \in \mathbb{N} \} < + \infty \). Accordingly, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \tilde{\Gamma }_{n} \leq D \mathbb{E} \Biggl[ \sum _{k=2}^{n} \sum_{i=1}^{N} \biggl( \frac{h_{k,i}}{\gamma _{k}} - \frac{h_{k-1,i}}{\gamma _{k-1}} \biggr) \Biggr] = D \mathbb{E} \Biggl[ \sum_{i=1}^{N} \biggl( \frac{h_{n,i}}{\gamma _{n}} - \frac{h_{1,i}}{\gamma _{1}} \biggr) \Biggr]. \end{aligned}$$

Hence, (32), together with \(\mathbb{E} [\| x_{1} - x\|_{\mathsf{H}_{1}}^{2}]/\gamma _{1} \leq D \mathbb{E} [ \sum_{i=1}^{N} h_{1,i}/\gamma _{1}]\), implies that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \Gamma _{n} \leq D \mathbb{E} \Biggl[ \sum _{i=1}^{N} \frac{h_{1,i}}{\gamma _{1}} \Biggr] + D \mathbb{E} \Biggl[ \sum_{i=1}^{N} \biggl( \frac{h_{n,i}}{\gamma _{n}} - \frac{h_{1,i}}{\gamma _{1}} \biggr) \Biggr] = \frac{D}{\gamma _{n}} \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr], \end{aligned}$$

which, together with the existence of \(a > 0\) such that, for all \(n\in \mathbb{N}\), \(\alpha _{n} \leq a < 1\) (by (29)) and \(\tilde{a} := 1 -a\), implies that

$$\begin{aligned} \Gamma _{n} \leq \frac{D}{\tilde{a} \tilde{b} \lambda _{n}} \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr]. \end{aligned}$$
(35)

The Cauchy–Schwarz inequality, together with \(D := \max_{i=1,2,\ldots ,N} \sup \{ (x_{n,i} - x_{i})^{2} \colon n \in \mathbb{N} \} < + \infty \) and \(\mathbb{E}[\|m_{n}\|] \leq \tilde{M} := \sqrt{\max \{ \|m_{-1}\|^{2}, M^{2} \}}\) (\(n\in \mathbb{N}\)) (by Lemma 4.1), guarantees that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \begin{aligned} B_{n} &\leq \sum _{k=1}^{n} \frac{\beta _{k}}{1 - \beta _{k}} \mathbb{E} \bigl[ \Vert x - x_{k} \Vert \Vert m_{k-1} \Vert \bigr] \leq \frac{\sqrt{DN}}{\tilde{b}} \sum_{k=1}^{n} \beta _{k} \mathbb{E} \bigl[ \Vert m_{k-1} \Vert \bigr] \\ &\leq \frac{\tilde{M}\sqrt{DN}}{\tilde{b}} \sum_{k=1}^{n} \beta _{k}. \end{aligned} \end{aligned}$$
(36)

Since \(\mathbb{E}[ \|\mathsf{d}_{n} \|_{\mathsf{H}_{n}}^{2} ] \leq h_{\star }^{2} \tilde{M}^{2}\) (\(n\in \mathbb{N}\)) holds (by Lemma 4.1), we have that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \Lambda _{n} := \sum_{k=1}^{n} \lambda _{k} \mathbb{E} \bigl[ \Vert \mathsf{d}_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr] \leq h_{\star }^{2} \tilde{M}^{2} \sum_{k=1}^{n} \lambda _{k}. \end{aligned}$$
(37)

Therefore, (31), (35), (36), and (37), together with the convexity of f, imply that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \mathbb{E} \bigl[ f (\tilde{x}_{n}) - f(x) \bigr] &\leq \frac{D}{2 \tilde{a} \tilde{b}n \lambda _{n}} \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr] + \frac{\tilde{M}\sqrt{DN}}{\tilde{b}n} \sum_{k=1}^{n} \beta _{k} + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 \tilde{b} n} \sum_{k=1}^{n} \lambda _{k}. \end{aligned}$$

Lemma 4.1 ensures that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} &\tilde{a} \sum_{k=1}^{n} \mathbb{E} \bigl[ \Vert x_{k+1} - y_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\quad \leq \underbrace{\sum_{k=1}^{n} \bigl\{ \mathbb{E} \bigl[ \Vert x_{k} - x \Vert _{\mathsf{H}_{k}}^{2} \bigr] - \mathbb{E} \bigl[ \Vert x_{k+1} - x \Vert _{\mathsf{H}_{k}}^{2} \bigr] \bigr\} }_{X_{n}} + \sum _{k=1}^{n} (1-\alpha _{k}) \lambda _{k}^{2} \mathbb{E} \bigl[ \Vert \mathsf{d}_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\qquad{} + 2 \sum_{k=1}^{n} (1-\alpha _{k}) \lambda _{k} \bigl\{ (1 - \beta _{k}) \mathbb{E} \bigl[ f(x) - f(x_{k}) \bigr] + \beta _{k} \mathbb{E} \bigl[ \langle x - x_{k},m_{k-1} \rangle \bigr] \bigr\} . \end{aligned}$$

A discussion similar to the one for obtaining (35) implies that

$$\begin{aligned} X_{n} \leq D \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{1,i} \Biggr] + D \mathbb{E} \Biggl[ \sum _{i=1}^{N} (h_{n,i} - h_{1,i}) \Biggr] = D \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr]. \end{aligned}$$

The continuity of f (see (A2)) and (A4) mean that \(\hat{M} := \sup \{\mathbb{E}[f(x) - f(x_{n})] \colon n\in \mathbb{N} \} < + \infty \). Accordingly, an argument similar to the one for obtaining (36) and (37) guarantees that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} &\frac{1}{n} \sum_{k=1}^{n} \mathbb{E} \bigl[ \Vert x_{k+1} - y_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\quad \leq \frac{D}{\tilde{a}n} \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr] + \frac{2 \hat{M}}{\tilde{a}n} \sum_{k=1}^{n} (1-\alpha _{k}) (1- \beta _{k})\lambda _{k} + \frac{2 \tilde{M} \sqrt{DN}}{\tilde{a}n} \sum_{k=1}^{n} (1-\alpha _{k})\beta _{k} \lambda _{k} \\ &\qquad{} + \frac{h_{\star }^{2} \tilde{M}^{2}}{\tilde{a}n} \sum_{k=1}^{n} (1-\alpha _{k}) \lambda _{k}^{2}. \end{aligned}$$

From (29), there exists \(c > 0\) such that, for all \(n\in \mathbb{N}\), \(c \leq \alpha _{n}\). Setting \(\tilde{c} := 1 -c\), it follows that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \begin{aligned} &\frac{1}{n} \sum _{k=1}^{n} \mathbb{E} \bigl[ \Vert x_{k+1} - y_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\quad \leq \frac{D}{\tilde{a}n} \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr] + \frac{2 \tilde{c}\hat{M}}{\tilde{a}n} \sum_{k=1}^{n} (1- \beta _{k})\lambda _{k} + \frac{2 \tilde{c} \tilde{M} \sqrt{DN}}{\tilde{a}n} \sum _{k=1}^{n} \beta _{k} \lambda _{k} \\ &\qquad{} + \frac{\tilde{c} h_{\star }^{2} \tilde{M}^{2}}{\tilde{a}n} \sum_{k=1}^{n} \lambda _{k}^{2}. \end{aligned} \end{aligned}$$
(38)

A discussion similar to the one for obtaining (38) ensures that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \begin{aligned} &\frac{1}{n} \sum _{k=1}^{n} \mathbb{E} \bigl[ \Vert x_{k+1} - x_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\quad \leq \frac{D}{c n} \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr] + \frac{2 \tilde{c}\hat{M}}{cn} \sum_{k=1}^{n} (1-\beta _{k}) \lambda _{k} + \frac{2 \tilde{c} \tilde{M} \sqrt{DN}}{cn} \sum _{k=1}^{n} \beta _{k} \lambda _{k} \\ &\qquad{} + \frac{\tilde{c} h_{\star }^{2} \tilde{M}^{2}}{cn} \sum_{k=1}^{n} \lambda _{k}^{2}. \end{aligned} \end{aligned}$$
(39)

Suppose that (A1)’ holds. Then we have that, for all \(k\in \mathbb{N}\), almost surely \(\| y_{k} - Q_{\mathsf{H}_{k}}(x_{k}) \|_{\mathsf{H}_{k}} = \| Q_{ \mathsf{H}_{k}} (x_{k} + \lambda _{k} \mathsf{d}_{k}) - Q_{\mathsf{H}_{k}}(x_{k}) \|_{\mathsf{H}_{k}} \leq \lambda _{k} \|\mathsf{d}_{k}\|_{\mathsf{H}_{k}}\), which, together with \(\| x-y \|^{2} \leq 2 \|x\|^{2} + 2 \|y\|^{2}\) (\(x,y\in \mathbb{R}^{N}\)), implies that

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\Vert x_{k} - Q_{\mathsf{H}_{k}}(x_{k}) \bigr\Vert _{\mathsf{H}_{k}}^{2} \bigr] &\leq 2 \mathbb{E} \bigl[ \Vert x_{k} - y_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr] + 2 \mathbb{E} \bigl[ \bigl\Vert y_{k} - Q_{\mathsf{H}_{k}}(x_{k}) \bigr\Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\leq 2 \mathbb{E} \bigl[ \Vert x_{k} - y_{k} \Vert _{ \mathsf{H}_{k}}^{2} \bigr] + 2 \lambda _{k}^{2} \mathbb{E} \bigl[ \Vert \mathsf{d}_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr]. \end{aligned}$$

Accordingly, (38) and (39) guarantee that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} &\frac{1}{n} \sum_{k=1}^{n} \mathbb{E} \bigl[ \bigl\Vert x_{k} - Q_{ \mathsf{H}_{k}}(x_{k}) \bigr\Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\quad \leq \frac{4}{n} \sum_{k=1}^{n} \mathbb{E} \bigl[ \Vert x_{k} - x_{k+1} \Vert _{\mathsf{H}_{k}}^{2} \bigr] + \frac{4}{n} \sum _{k=1}^{n} \mathbb{E} \bigl[ \Vert x_{k+1} - y_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr] + \frac{2}{n} \sum_{k=1}^{n} \lambda _{k}^{2} \mathbb{E} \bigl[ \Vert \mathsf{d}_{k} \Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\quad \leq 4 \biggl( \frac{1}{\tilde{a}} + \frac{1}{c} \biggr) \Biggl\{ \frac{D}{n} \mathbb{E} \Biggl[ \sum_{i=1}^{N} h_{n,i} \Biggr] + \frac{2\tilde{c} \hat{M}}{n} \sum_{k=1}^{n} (1-\beta _{k})\lambda _{k} + \frac{2 \tilde{c} \tilde{M} \sqrt{DN}}{n} \sum _{k=1}^{n} \beta _{k} \lambda _{k} \Biggr\} \\ &\qquad{} + \biggl\{ 4 \biggl( \frac{1}{\tilde{a}} + \frac{1}{c} \biggr) \tilde{c} + 2 \biggr\} \frac{h_{\star }^{2} \tilde{M}^{2}}{n} \sum_{k=1}^{n} \lambda _{k}^{2}, \end{aligned}$$

which completes the proof. □

Constant step-size rule

The following theorem indicates that sufficiently small constant step-sizes \(\beta _{n} := \beta \) and \(\lambda _{n} := \lambda \) allow a solution to the problem to be approximated.

Theorem 5.2

Suppose that the assumptions in Theorem 5.1hold and also assume that, for all \(i=1,2,\ldots ,N\), there exists a positive number \(B_{i}\) such thatFootnote 5

$$\begin{aligned} \sup \bigl\{ \mathbb{E}[h_{n,i}] \colon n\in \mathbb{N} \bigr\} \leq B_{i}. \end{aligned}$$
(40)

Then Algorithm 1 with \(\alpha _{n} := \alpha \), \(\beta _{n} := \beta \), and \(\lambda _{n} := \lambda \) (\(n\in \mathbb{N}\)) satisfies that

$$\begin{aligned} &\liminf_{n \to + \infty } \mathbb{E} \bigl[ \Vert x_{n} - x_{n+1} \Vert _{ \mathsf{H}_{n}}^{2} \bigr] \leq \frac{2 \tilde{\alpha }}{\alpha } \biggl\{ \hat{M} (1 -\beta ) + \tilde{M} \sqrt{DN} \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2} \lambda \biggr\} \lambda , \end{aligned}$$
(41)
$$\begin{aligned} &\liminf_{n \to + \infty } \mathbb{E} \bigl[ \Vert x_{n+1} - y_{n} \Vert _{ \mathsf{H}_{n}}^{2} \bigr] \leq 2 \biggl\{ \hat{M} (1 -\beta ) + \tilde{M} \sqrt{DN} \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2} \lambda \biggr\} \lambda , \end{aligned}$$
(42)
$$\begin{aligned} &\liminf_{n \to + \infty } \mathbb{E} \bigl[ f(x_{n}) - f^{\star } \bigr] \leq \frac{\tilde{M} \sqrt{DN}}{1-\beta } \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 (1-\beta )} \lambda , \end{aligned}$$
(43)
$$\begin{aligned} &\mathbb{E} \bigl[ f (\tilde{x}_{n}) - f^{\star } \bigr] \leq \mathcal{O} \biggl( \frac{1}{n} \biggr) + \frac{\tilde{M}\sqrt{DN}}{1 - \beta } \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 (1-\beta )} \lambda , \end{aligned}$$
(44)

where \(\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k}\) and \(\tilde{\alpha } := 1 -\alpha \). Under (A1), we have

$$\begin{aligned} \begin{aligned} &\frac{1}{n} \sum_{k=1}^{n} \mathbb{E} \bigl[ \bigl\Vert x_{k} - Q_{ \mathsf{H}_{k}} (x_{k}) \bigr\Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\quad \leq \mathcal{O} \biggl( \frac{1}{n} \biggr) + \frac{4}{\alpha } \bigl\{ 2\hat{M} (1-\beta ) + 2 \tilde{M} \sqrt{DN} \beta + 2 h_{\star }^{2} \tilde{M}^{2} \lambda \bigr\} \lambda + 2 h_{\star }^{2} \tilde{M}^{2} \lambda ^{2}. \end{aligned} \end{aligned}$$
(45)

Proof

We first show that, for all \(\epsilon > 0\),

$$\begin{aligned} \begin{aligned} \liminf_{n \to + \infty } \mathbb{E} \bigl[ \Vert x_{n} - x_{n+1} \Vert _{ \mathsf{H}_{n}}^{2} \bigr] &\leq \frac{2 \tilde{\alpha }}{\alpha } \biggl\{ \hat{M} (1 -\beta ) + \tilde{M} \sqrt{DN} \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2} \lambda \biggr\} \lambda \\ &\quad{} + D \epsilon + \epsilon . \end{aligned} \end{aligned}$$
(46)

If (46) does not hold, then there exists \(\epsilon _{0} > 0\) such that

$$\begin{aligned} \begin{aligned} \liminf_{n \to + \infty } \mathbb{E} \bigl[ \Vert x_{n} - x_{n+1} \Vert _{ \mathsf{H}_{n}}^{2} \bigr] &> \frac{2 \tilde{\alpha }}{\alpha } \biggl\{ \hat{M} (1 - \beta ) + \tilde{M} \sqrt{DN} \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2} \lambda \biggr\} \lambda \\ &\quad{} + D \epsilon _{0} + \epsilon _{0}. \end{aligned} \end{aligned}$$
(47)

Let \(x\in X\) and \(\chi _{n} := \mathbb{E} [ \| x_{n} - x \|_{\mathsf{H}_{n}}^{2}]\) for all \(n\in \mathbb{N}\). Lemma 4.1, together with the proofs of (36) and (37), implies that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \begin{aligned} \chi _{n+1} &\leq \chi _{n} + \underbrace{\chi _{n+1} - \mathbb{E} \bigl[ \Vert x_{n+1} - x \Vert _{\mathsf{H}_{n}}^{2} \bigr]}_{ \tilde{X}_{n}} - \alpha \mathbb{E} \bigl[ \Vert x_{n+1} - x_{n} \Vert _{\mathsf{H}_{n}}^{2} \bigr] \\ &\quad{} + 2 \tilde{\alpha } \lambda \biggl\{ \hat{M}(1 - \beta ) + \tilde{M} \sqrt{DN} \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2} \lambda \biggr\} . \end{aligned} \end{aligned}$$
(48)

From (34) and (A4), for all \(n\in \mathbb{N}\),

$$\begin{aligned} \tilde{X}_{n} = \mathbb{E} \Biggl[ \sum_{i=1}^{N} (h_{n+1,i} - h_{n,i}) (x_{n+1,i} - x_{i})^{2} \Biggr] \leq D \mathbb{E} \Biggl[ \sum_{i=1}^{N} (h_{n+1,i} - h_{n,i}) \Biggr]. \end{aligned}$$

Accordingly, (30) and (40) ensure that there exists \(n_{0} \in \mathbb{N}\) such that, for all \(n \geq n_{0}\),

$$\begin{aligned} \tilde{X}_{n} \leq D \alpha \epsilon _{0}. \end{aligned}$$
(49)

Hence, (48) implies that, for all \(n \geq n_{0}\),

$$\begin{aligned} \chi _{n+1} &\leq \chi _{n} + D \alpha \epsilon _{0} - \alpha \mathbb{E} \bigl[ \Vert x_{n+1} - x_{n} \Vert _{\mathsf{H}_{n}}^{2} \bigr] \\ &\quad{} + 2 \tilde{\alpha } \lambda \biggl\{ \hat{M}(1 - \beta ) + \tilde{M} \sqrt{DN} \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2} \lambda \biggr\} . \end{aligned}$$

From (47), there exists \(n_{1} \in \mathbb{N}\) such that, for all \(n \geq n_{1}\),

$$\begin{aligned} \mathbb{E} \bigl[ \Vert x_{n} - x_{n+1} \Vert _{\mathsf{H}_{n}}^{2} \bigr] &> \frac{2 \tilde{\alpha }}{\alpha } \biggl\{ \hat{M} (1 - \beta ) + \tilde{M} \sqrt{DN} \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2} \lambda \biggr\} \lambda + D \epsilon _{0} + \frac{\epsilon _{0}}{2}. \end{aligned}$$

Therefore, for all \(n \geq n_{2} := \max \{n_{0}, n_{1}\}\),

$$\begin{aligned} \chi _{n+1} &\leq \chi _{n} + D \alpha \epsilon _{0} - 2 \tilde{\alpha } \lambda \biggl\{ \hat{M} (1 -\beta ) + \tilde{M} \sqrt{DN} \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2} \lambda \biggr\} - D \alpha \epsilon _{0} - \frac{\alpha \epsilon _{0}}{2} \\ &\quad{} + 2 \tilde{\alpha } \lambda \biggl\{ \hat{M}(1 - \beta ) + \tilde{M} \sqrt{DN} \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2} \lambda \biggr\} \\ &= \chi _{n} - \frac{\alpha \epsilon _{0}}{2} \\ &\leq \chi _{n_{2}} - \frac{\alpha \epsilon _{0}}{2} (n + 1 - n_{2}), \end{aligned}$$

which is a contradiction since the right-hand side of the above inequality approaches minus infinity as n increases. Hence, (46) holds for all ϵ, which implies that (41) holds. A discussion similar to the one for showing (46) leads to (42). We next show that, for all \(\epsilon > 0\),

$$\begin{aligned} \liminf_{n \to + \infty } \mathbb{E} \bigl[ f(x_{n}) - f^{\star } \bigr] \leq \frac{\tilde{M} \sqrt{DN}}{1-\beta } \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 (1-\beta )} \lambda + \frac{D \alpha \epsilon }{2 \tilde{\alpha } (1-\beta )\lambda } + \epsilon . \end{aligned}$$
(50)

If (50) does not hold for all \(\epsilon > 0\), then there exist \(\epsilon _{0} > 0\) and \(n_{3} \in \mathbb{N}\) such that, for all \(n \geq n_{3}\),

$$\begin{aligned} \mathbb{E} \bigl[ f(x_{n}) - f^{\star } \bigr] > \frac{\tilde{M} \sqrt{DN}}{1-\beta } \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 (1-\beta )} \lambda + \frac{D \alpha \epsilon _{0}}{2 \tilde{\alpha } (1-\beta )\lambda } + \frac{\epsilon _{0}}{2}. \end{aligned}$$

Lemma 4.1, together with (48) and (49), ensures that, for all \(n\geq n_{0}\),

$$\begin{aligned} \chi _{n+1} &\leq \chi _{n} + D \alpha \epsilon _{0} - 2 \tilde{\alpha } (1-\beta ) \lambda \mathbb{E} \bigl[ f(x_{n}) - f^{\star } \bigr] + \bigl\{ 2 \tilde{M} \sqrt{DN} \beta + h_{\star }^{2} \tilde{M}^{2} \lambda \bigr\} \tilde{ \alpha } \lambda . \end{aligned}$$

Accordingly, for all \(n \geq n_{4} := \max \{n_{0}, n_{3}\}\),

$$\begin{aligned} &\chi _{n+1} \\ &\quad \leq \chi _{n} + D \alpha \epsilon _{0} - 2 \tilde{\alpha } (1-\beta ) \lambda \biggl\{ \frac{\tilde{M} \sqrt{DN}}{1-\beta } \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 (1-\beta )} \lambda + \frac{D \alpha \epsilon _{0}}{2 \tilde{\alpha } (1-\beta )\lambda } + \frac{\epsilon _{0}}{2} \biggr\} \\ &\qquad{} + \bigl\{ 2 \tilde{M} \sqrt{DN} \beta + h_{\star }^{2} \tilde{M}^{2} \lambda \bigr\} \tilde{\alpha } \lambda \\ &\quad = \chi _{n} - \tilde{\alpha } (1-\beta ) \lambda \epsilon _{0} \\ &\quad \leq \chi _{n_{4}} - \tilde{\alpha } (1-\beta ) \lambda \epsilon _{0} (n+1 - n_{4}), \end{aligned}$$

which is a contradiction. Since (50) holds for all \(\epsilon > 0\), we have (43). Conditions (44) and (45) follow from Theorem 5.1, which completes the proof. □

Diminishing step-size rule

Lemma 4.1 and Theorem 5.1 give us the following theorem as a convergence analysis of Algorithm 1 with a diminishing step-size.

Theorem 5.3

Suppose that the assumptions in Theorem 5.1and (40) hold. Let \((\beta _{n})_{n\in \mathbb{N}}\) and \((\lambda _{n})_{n\in \mathbb{N}}\) satisfy the following:

$$\begin{aligned} \lim_{n\to +\infty } \beta _{n} = 0, \qquad \sum_{n=0}^{+\infty } \lambda _{n} = + \infty , \qquad \sum_{n=0}^{+\infty } \lambda _{n}^{2} < + \infty , \quad \textit{and}\quad \sum _{n=0}^{+ \infty } \beta _{n} \lambda _{n} < + \infty . \end{aligned}$$
(51)

Then Algorithm 1 satisfies that

$$\begin{aligned} &\liminf_{n\to + \infty } \mathbb{E} \bigl[ \Vert x_{n} - x_{n+1} \Vert _{\mathsf{H}_{n}} \bigr] = 0, \qquad \liminf _{n\to + \infty } \mathbb{E} \bigl[ \Vert x_{n+1} - y_{n} \Vert _{ \mathsf{H}_{n}} \bigr] = 0, \end{aligned}$$
(52)
$$\begin{aligned} &\liminf_{n\to + \infty } \mathbb{E} \bigl[ f(x_{n}) - f^{\star } \bigr] \leq 0. \end{aligned}$$
(53)

Moreover, if (A1)’ holds, then we have

$$\begin{aligned} \liminf_{n\to + \infty } \mathbb{E} \bigl[ \bigl\Vert x_{n} - Q_{ \mathsf{H}_{n}} (x_{n}) \bigr\Vert _{\mathsf{H}_{n}} \bigr] = 0. \end{aligned}$$

Let \((\beta _{n})_{n\in \mathbb{N}}\) and \((\lambda _{n})_{n\in \mathbb{N}}\) satisfy the following:

$$\begin{aligned} \lim_{n\to + \infty } \frac{1}{n\lambda _{n}} = 0, \qquad \lim _{n \to + \infty } \frac{1}{n} \sum_{k=1}^{n} \lambda _{k} = 0, \quad \textit{and}\quad \lim_{n \to + \infty } \frac{1}{n} \sum_{k=1}^{n} \beta _{k} = 0. \end{aligned}$$
(54)

Then the sequence \((\tilde{x}_{n})_{n\in \mathbb{N}}\) defined by \(\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k}\) satisfies

$$\begin{aligned} \limsup_{n\to +\infty } \mathbb{E} \bigl[f(\tilde{x}_{n}) - f^{\star } \bigr] \leq 0 \end{aligned}$$

with

$$\begin{aligned} \mathbb{E} \bigl[ f (\tilde{x}_{n}) - f^{\star } \bigr] &\leq \frac{D \sum_{i=1}^{N} B_{i} }{2 \tilde{a} \tilde{b}n \lambda _{n}} + \frac{\tilde{M}\sqrt{DN}}{\tilde{b}n} \sum_{k=1}^{n} \beta _{k} + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 \tilde{b} n} \sum_{k=1}^{n} \lambda _{k}. \end{aligned}$$

Moreover, if (A1)’ holds, then we have

$$\begin{aligned} \lim_{n \to + \infty } \frac{1}{n} \sum _{k=1}^{n} \mathbb{E} \bigl[ \bigl\Vert x_{k} - Q_{\mathsf{H}_{k}} (x_{k}) \bigr\Vert _{\mathsf{H}_{k}}^{2} \bigr] = 0 \end{aligned}$$

with

$$\begin{aligned} &\frac{1}{n} \sum_{k=1}^{n} \mathbb{E} \bigl[ \bigl\Vert x_{k} - Q_{ \mathsf{H}_{k}} (x_{k}) \bigr\Vert _{\mathsf{H}_{k}}^{2} \bigr] \\ &\quad \leq 4 \biggl( \frac{1}{\tilde{a}} + \frac{1}{c} \biggr) \Biggl\{ \frac{D \sum_{i=1}^{N} B_{i}}{n} + \frac{2\tilde{c} \hat{M}}{n} \sum_{k=1}^{n} (1-\beta _{k})\lambda _{k} + \frac{2 \tilde{c} \tilde{M} \sqrt{DN}}{n} \sum _{k=1}^{n} \beta _{k} \lambda _{k} \Biggr\} \\ &\qquad{} + \biggl\{ 4 \biggl( \frac{1}{\tilde{a}} + \frac{1}{c} \biggr) \tilde{c} + 2 \biggr\} \frac{h_{\star }^{2} \tilde{M}^{2}}{n} \sum_{k=1}^{n} \lambda _{k}^{2}. \end{aligned}$$

Proof

We first show (52). Lemma 4.1, together with (36), (37), and (48), implies that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \begin{aligned} \left.\textstyle\begin{array}{l} \alpha _{n} \mathbb{E} [ \Vert x_{n+1} - x_{n} \Vert _{ \mathsf{H}_{n}}^{2} ] \\ (1-\alpha _{n}) \mathbb{E} [ \Vert x_{n+1} - y_{n} \Vert _{ \mathsf{H}_{n}}^{2} ] \end{array}\displaystyle \right\} &\leq \chi _{n}(x) - \chi _{n+1}(x) + D \mathbb{E} \Biggl[ \sum _{i=1}^{N} (h_{n+1,i} - h_{n,i} ) \Biggr] \\ &\quad{} +2 \hat{M} \lambda _{n} + 2 \tilde{M} \sqrt{DN} \beta _{n} \lambda _{n} + h_{\star }^{2} \tilde{M}^{2} \lambda _{n}^{2}, \end{aligned} \end{aligned}$$
(55)

where \(\chi _{n}(x) := \mathbb{E}[\| x_{n} - x\|_{\mathsf{H}_{n}}^{2}]\) for all \(x\in X\) and all \(n\in \mathbb{N}\). Consider (Case 1): For all \(x\in X\), there exists \(m_{0} \in \mathbb{N}\) such that, for all \(n\in \mathbb{N}\), \(n \geq m_{0}\) implies \(\chi _{n+1}(x) \leq \chi _{n}(x)\). This case guarantees the existence of \(\lim_{n\to + \infty } \chi _{n}(x)\) for all \(x\in X\). From (30) and (40), we have that \(\lim_{n\to + \infty } \mathbb{E} [ \sum_{i=1}^{N} (h_{n+1,i} - h_{n,i})] = 0\). Moreover, (51) ensures that \(\lim_{n\to + \infty } \beta _{n} = \lim_{n\to + \infty } \lambda _{n} = 0\). Accordingly, (55) and \(0< \liminf_{n\to + \infty } \alpha _{n} \leq \limsup_{n\to + \infty } \alpha _{n} < 1\) (by (29)) imply that

$$\begin{aligned} \lim_{n\to + \infty } \mathbb{E} \bigl[ \Vert x_{n+1} - x_{n} \Vert _{\mathsf{H}_{n}} \bigr] = 0 \quad \text{and}\quad \lim_{n\to + \infty } \mathbb{E} \bigl[ \Vert x_{n+1} - y_{n} \Vert _{ \mathsf{H}_{n}} \bigr] = 0. \end{aligned}$$
(56)

Consider (Case 2): There exists \(x_{0} \in X\), for all \(m \in \mathbb{N}\), there exists \(n\in \mathbb{N}\) such that \(n \geq m\) and \(\chi _{n+1}(x_{0}) > \chi _{n}(x_{0})\). In this case, there exists \((x_{n_{i}})_{i\in \mathbb{N}} \subset (x_{n})_{n\in \mathbb{N}}\) such that, for all \(i\in \mathbb{N}\), \(\chi _{n_{i} +1}(x_{0}) > \chi _{n_{i}}(x_{0})\). From (55), we have that, for all \(i\in \mathbb{N}\),

$$\begin{aligned} \left.\textstyle\begin{array}{l} \alpha _{n_{i}} \mathbb{E} [ \Vert x_{n_{i} +1} - x_{n_{i}} \Vert _{\mathsf{H}_{n_{i}}}^{2} ] \\ (1-\alpha _{n}) \mathbb{E} [ \Vert x_{n_{i} +1} - y_{n_{i}} \Vert _{\mathsf{H}_{n_{i}}}^{2} ] \end{array}\displaystyle \right\} &< D \mathbb{E} \Biggl[ \sum _{j=1}^{N} (h_{n_{i} +1,j} - h_{n_{i},j} ) \Biggr] \\ &\quad{} +2 \hat{M} \lambda _{n_{i}} + 2 \tilde{M} \sqrt{DN} \beta _{n_{i}} \lambda _{n_{i}} + h_{\star }^{2} \tilde{M}^{2} \lambda _{n_{i}}^{2}. \end{aligned}$$

A discussion similar to the one for showing (56) guarantees that

$$\begin{aligned} \lim_{i\to + \infty } \mathbb{E} \bigl[ \Vert x_{n_{i} +1} - x_{n_{i}} \Vert _{\mathsf{H}_{n_{i}}} \bigr] = 0\quad \text{and}\quad \lim_{i \to + \infty } \mathbb{E} \bigl[ \Vert x_{n_{i} +1} - y_{n_{i}} \Vert _{\mathsf{H}_{n_{i}}} \bigr] = 0. \end{aligned}$$
(57)

Therefore, we have (52). If (A1)’ holds, then Lemma 4.1 implies that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} \mathbb{E} \bigl[ \bigl\Vert y_{n} - Q_{\mathsf{H}_{n}} (x_{n}) \bigr\Vert _{\mathsf{H}_{n}} \bigr] \leq h_{\star } \tilde{M} \lambda _{n}, \end{aligned}$$

which implies that \(\lim_{n\to + \infty } \mathbb{E} [ \| y_{n} - Q_{\mathsf{H}_{n}} (x_{n}) \|_{\mathsf{H}_{n}} ] = 0\). In (Case 1), (56) and the triangle inequality mean that \(\lim_{n\to + \infty } \mathbb{E} [ \| x_{n} - y_{n} \|_{\mathsf{H}_{n}} ] = 0\). Accordingly, the triangle inequality and \(\lim_{n\to + \infty } \mathbb{E} [ \| y_{n} - Q_{\mathsf{H}_{n}} (x_{n}) \|_{\mathsf{H}_{n}} ] = 0\) imply that \(\lim_{n\to + \infty } \mathbb{E} [ \| x_{n} - Q_{\mathsf{H}_{n}} (x_{n}) \|_{\mathsf{H}_{n}} ] = 0\). In (Case 2), (57) and the triangle inequality mean that \(\lim_{i\to + \infty } \mathbb{E} [ \| x_{n_{i}} - y_{n_{i}} \|_{ \mathsf{H}_{n_{i}}} ] = 0\). Accordingly, the triangle inequality and \(\lim_{i\to + \infty } \mathbb{E} [ \| y_{n_{i}} - Q_{\mathsf{H}_{n_{i}}} (x_{n_{i}}) \|_{\mathsf{H}_{n_{i}}} ] = 0\) imply that \(\lim_{i\to + \infty } \mathbb{E} [ \| x_{n_{i}} - Q_{\mathsf{H}_{n_{i}}} (x_{n_{i}}) \|_{\mathsf{H}_{n_{i}}} ] = 0\). Thus, we have that

$$\begin{aligned} \liminf_{n\to + \infty } \mathbb{E} \bigl[ \bigl\Vert x_{n} - Q_{ \mathsf{H}_{n}} (x_{n}) \bigr\Vert _{\mathsf{H}_{n}} \bigr] = 0. \end{aligned}$$

Next, we show (53). Lemma 4.1, together with (36) and (37), ensures that, for all \(x^{\star }\in X^{\star }\) and all \(k \in \mathbb{N}\),

$$\begin{aligned} &2 (1-\alpha _{k}) (1-\beta _{k}) \lambda _{k} \mathbb{E} \bigl[ f(x_{k}) - f^{\star } \bigr] \\ &\quad \leq \chi _{k}^{\star }- \chi _{k+1}^{\star }+ D \mathbb{E} \Biggl[ \sum_{i=1}^{N} (h_{k+1,i} - h_{k,i} ) \Biggr] + 2 \tilde{M} \sqrt{DN} \beta _{k} \lambda _{k} + h_{\star }^{2} \tilde{M}^{2} \lambda _{k}^{2}, \end{aligned}$$

where \(\chi _{n}^{\star }:= \chi _{n}(x^{\star })\) for all \(x^{\star }\in X^{\star }\) and all \(n\in \mathbb{N}\). Summing the above inequality from \(k=0\) to \(k = n\) gives that, for all \(n\in \mathbb{N}\),

$$\begin{aligned} &2 \sum_{k=0}^{n} (1-\alpha _{k}) (1-\beta _{k}) \lambda _{k} \mathbb{E} \bigl[ f(x_{k}) - f^{\star } \bigr] \\ &\quad \leq \chi _{0}^{\star }+ D \mathbb{E} \Biggl[ \sum _{i=1}^{N} h_{n+1,i} \Biggr] + 2 \tilde{M} \sqrt{DN} \sum_{k=0}^{n} \beta _{k} \lambda _{k} + h_{\star }^{2} \tilde{M}^{2} \sum_{k=0}^{n} \lambda _{k}^{2}, \end{aligned}$$

which, together with (40) and (51), implies that

$$\begin{aligned} \sum_{k=0}^{+\infty } (1-\alpha _{k}) (1-\beta _{k}) \lambda _{k} \mathbb{E} \bigl[ f(x_{k}) - f^{\star } \bigr] < + \infty . \end{aligned}$$

If (53) does not hold, then there exist \(\zeta > 0\) and \(m_{1} \in \mathbb{N}\) such that, for all \(k \geq m_{1}\), \(\mathbb{E} [ f(x_{k}) - f^{\star } ] \geq \zeta \). Hence, we have that

$$\begin{aligned} + \infty = \zeta \sum_{k=0}^{+\infty } (1-\alpha _{k}) (1-\beta _{k}) \lambda _{k} \leq \sum _{k=0}^{+\infty } (1-\alpha _{k}) (1- \beta _{k}) \lambda _{k} \mathbb{E} \bigl[ f(x_{k}) - f^{\star } \bigr] < + \infty , \end{aligned}$$

where the first equation comes from \(\limsup_{n \to + \infty } \alpha _{n} < 1\), \(\sum_{n=0}^{+\infty } \lambda _{n} = + \infty \), and \(\sum_{n=0}^{+\infty } \beta _{n} \lambda _{n} < + \infty \) (by (29) and (51)). Since we have a contradiction, (53) holds. Theorem 5.1, together with (40) and (54), ensures that

$$\limsup_{n\to +\infty } \mathbb{E}\bigl[f(\tilde{x}_{n}) - f^{\star }\bigr] \leq 0 \quad \text{and}\quad \lim_{n \to + \infty } \frac{1}{n} \sum_{k=1}^{n} \mathbb{E}\bigl[ \bigl\| x_{k} - Q_{ \mathsf{H}_{k}} (x_{k}) \bigr\| _{\mathsf{H}_{k}}^{2}\bigr] = 0 $$

with the convergence rate in Theorem 5.3. □

Theorem 5.3 leads to the following corollary.

Corollary 5.1

Suppose that the assumptions in Theorem 5.3and (A1)’ hold, and consider Algorithm 1 with \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in [1/2,1]\)) and \((\beta _{n})_{n\in \mathbb{N}}\) such that \(\sum_{n=1}^{+\infty } \beta _{n} < + \infty \). Under \(\eta \in (1/2,1]\), we have that

$$\begin{aligned} \liminf_{n \to + \infty } \mathbb{E} \bigl[ f(x_{n}) - f^{\star } \bigr] \leq 0, \qquad \liminf_{n \to + \infty } \mathbb{E} \bigl[ \bigl\Vert x_{n} - Q_{\mathsf{H}_{n}}(x_{n}) \bigr\Vert _{\mathsf{H}_{n}} \bigr] = 0. \end{aligned}$$

Under \(\eta \in [1/2,1)\), we have that

$$\begin{aligned} \limsup_{n \to + \infty } \mathbb{E} \bigl[ f(\tilde{x}_{n}) - f^{\star } \bigr] \leq 0, \qquad \lim_{n \to + \infty } \frac{1}{n} \sum_{k=1}^{n} \mathbb{E} \bigl[ \bigl\Vert x_{k} - Q_{\mathsf{H}_{k}} (x_{k}) \bigr\Vert _{\mathsf{H}_{k}}^{2} \bigr] = 0 \end{aligned}$$

with the rate of convergence

$$\begin{aligned} \mathbb{E} \bigl[ f (\tilde{x}_{n}) - f^{\star } \bigr] \leq \mathcal{O} \biggl( \frac{1}{n^{1-\eta }} \biggr), \qquad \frac{1}{n} \sum _{k=1}^{n} \mathbb{E} \bigl[ \bigl\Vert x_{k} - Q_{ \mathsf{H}_{k}} (x_{k}) \bigr\Vert _{\mathsf{H}_{k}}^{2} \bigr] = \mathcal{O} \biggl( \frac{1}{n^{\eta }} \biggr). \end{aligned}$$

Proof

The step-size \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in (1/2,1]\)) and \((\beta _{n})_{n\in \mathbb{N}}\) such that \(\sum_{n=1}^{+\infty } \beta _{n} < + \infty \) satisfy (51). Accordingly, Theorem 5.3 with (A1)’ implies that \(\liminf_{n \to + \infty } \mathbb{E}[ f(x_{n}) - f^{\star }] \leq 0\), and \(\liminf_{n \to + \infty } \mathbb{E}[ \|x_{n} - Q_{\mathsf{H}_{n}}(x_{n}) \|_{\mathsf{H}_{n}}] = 0\). The step-size \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in [1/2,1)\)) satisfies

$$\begin{aligned} \lim_{n \to + \infty } \frac{1}{n\lambda _{n}} = \lim_{n \to + \infty } \frac{1}{n^{1-\eta }} = 0. \end{aligned}$$

Moreover, we have that

$$\begin{aligned} \frac{1}{n} \sum_{k=1}^{n} \lambda _{k}^{2} \leq \frac{1}{n} \sum _{k=1}^{n} \lambda _{k} \leq \frac{1}{n} \biggl\{ 1 + \int _{1}^{n} \frac{\mathrm{d}t}{t^{\eta }} \biggr\} = \frac{1}{n} \biggl\{ \frac{n^{1-\eta }}{1-\eta } - \frac{\eta }{1-\eta } \biggr\} \leq \frac{1}{1-\eta } \frac{1}{n^{\eta }}. \end{aligned}$$
(58)

Hence, \(\lim_{n\to +\infty } (1/n) \sum_{k=1}^{n} \lambda _{k} = \lim_{n \to +\infty } (1/n) \sum_{k=1}^{n} \lambda _{k}^{2} = 0\). The condition \(\sum_{n=1}^{+\infty } \beta _{n} < + \infty \) implies that \(\lim_{n\to +\infty } (1/n) \sum_{k=1}^{n} \beta _{k} = 0\) and \(\lim_{n\to +\infty } (1/n) \sum_{k=1}^{n} \beta _{k} \lambda _{k} = 0\). Hence, (54) is satisfied. Accordingly, from Theorem 5.3 with (A1)’ and (58), we have the convergence rate of Algorithm 1 in Corollary 5.1. □

Comparisons of Algorithm 1 with the existing adaptive learning rate optimization algorithms

The main objective of the existing adaptive learning rate optimization algorithms is to minimize \(\sum_{t=1}^{T} f_{t} (x)\) subject to \(x\in X\), where T is the total number of rounds in the learning process, \(f_{t} \colon \mathbb{R}^{N} \to \mathbb{R}\) (\(t=1,2,\ldots ,T\)) is a differentiable, convex loss function, and \(X \subset \mathbb{R}^{N}\) is bounded, closed, and convex (see also problem (21) in Example 4.1(i)). We would like to achieve low regret on the sequence \((f_{t}(x_{t}))_{t=1}^{T}\), measured as

$$\begin{aligned} R(T) := \sum_{t=1}^{T} f_{t} (x_{t}) - \min_{x\in X} \sum _{t=1}^{T} f_{t} (x) =\sum _{t=1}^{T} f_{t} (x_{t}) - \sum _{t=1}^{T} f_{t} \bigl(x^{*}\bigr), \end{aligned}$$

where \(x^{*} \in X\) is a minimizer of \(\sum_{t=1}^{T} f_{t} (x)\) over X, and \((x_{t})_{t=1}^{T} \subset X\) is the sequence generated by a learning algorithm. Although Theorem 4.1 in [8] indicates that Adam [8, Algorithm 1], [2, Algorithm 8.7] (algorithm (6)) is such that there exists a positive real number D such that \(R(T)/T \leq D/\sqrt{T}\), the proof of Theorem 4.1 in [8] is incomplete [9, Theorem 1]. AMSGrad [9, Algorithm 2] (algorithm (9)) is such that the following result holds [9, Theorem 4, Corollary 1]: Suppose that \(\beta _{1,t} := \beta _{1} \lambda ^{t-1}\) (\(\beta _{1}, \lambda \in (0,1)\)), \(\gamma := \beta _{1}/\sqrt{\beta _{2}} < 1\), and \(\lambda _{t} := \alpha /\sqrt{t}\) (\(\alpha > 0\)). Then there exist positive real numbers \(\hat{D}_{i}\) (\(i=1,2,3\)) such that

$$\begin{aligned} \frac{R(T)}{T} &= \frac{1}{T} \sum_{t=1}^{T} f_{t} (x_{t}) - \frac{1}{T} \sum _{t=1}^{T} f_{t} \bigl(x^{*} \bigr) \\ &\leq \frac{\hat{D}_{1} N}{\alpha \tilde{\beta }_{1} \sqrt{T}} + \frac{\beta _{1} \hat{D}_{2}}{2 \tilde{\beta }_{1} (1-\lambda )^{2} T} + \frac{\alpha \sqrt{1 + \ln T}}{\tilde{\beta _{1}}^{2} (1-\gamma ) \sqrt{1-\beta _{2}}T} \sum _{i=1}^{N} \Vert g_{1:T,i} \Vert , \end{aligned}$$

where \(\tilde{\beta }_{1} := 1 -\beta _{1}\), \(g_{t} := \nabla _{x} F(x_{t},\xi _{t})\), Footnote 6 and \(\|g_{1:T,i} \| := \sqrt{\sum_{t=1}^{T} g_{t,i}^{2}} \leq \hat{D}_{3} \sqrt{T}\). Hence, with AMSGrad, there exists a positive real number such that

$$\begin{aligned} \frac{R(T)}{T} = \frac{1}{T} \sum _{t=1}^{T} f_{t} (x_{t}) - \frac{1}{T} \sum_{t=1}^{T} f_{t} \bigl(x^{*}\bigr) \leq \hat{D} \sqrt{ \frac{1+\ln T}{T}}. \end{aligned}$$
(59)

We apply Algorithm 1 with \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in [1/2,1)\)) (see also algorithm (22)) to Problem 3.1 for the special case where \(f(\cdot ) = \mathbb{E}[f_{\xi }(\cdot )] := (1/T) \sum_{t=1}^{T} f_{t}( \cdot )\), \(Q_{\mathsf{H}_{n}} := P_{X,\mathrm{H}_{n}}\) (\(n\in \mathbb{N}\)), \(\mathsf{H}_{n}\) is defined by either (19) or (20), and \(C =X\) (see also problem (21)). Then Theorem 5.2 has the following corollary.

Corollary 5.2

Consider problem (21) and suppose that the assumptions in Theorem 5.1hold. Then algorithm (22) satisfies that

$$\begin{aligned} &\liminf_{n\to + \infty } \mathbb{E} \Biggl[ \frac{1}{T} \sum _{t=1}^{T} f_{t} (x_{n}) - \frac{1}{T} \sum_{t=1}^{T} f_{t} \bigl(x^{*}\bigr) \Biggr] \leq \frac{\tilde{M}\sqrt{DN}}{1 - \beta } \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 (1-\beta )} \lambda , \\ &\limsup_{n\to + \infty } \mathbb{E} \Biggl[ \frac{1}{T} \sum _{t=1}^{T} f_{t} ( \tilde{x}_{n} ) - \frac{1}{T} \sum_{t=1}^{T} f_{t} \bigl(x^{*}\bigr) \Biggr] \leq \frac{\tilde{M}\sqrt{DN}}{1 - \beta } \beta + \frac{h_{\star }^{2} \tilde{M}^{2}}{2 (1-\beta )} \lambda , \end{aligned}$$

where \(\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k}\) and \((x_{n})_{n\in \mathbb{N}} \subset X\) is the sequence in algorithm (22).

In contrast to Adam and AMSGrad with diminishing step-sizes, Corollary 5.2 indicates that algorithm (22) with constant step-sizes may approximate a solution of problem (21).

Corollary 5.1 implies the following corollary.

Corollary 5.3

Suppose that the assumptions in Corollary 5.1hold and \(\lambda _{n} := 1/n^{\eta }\) (\(\eta \in [1/2,1]\)), and \((\beta _{n})_{n\in \mathbb{N}}\) is such that \(\sum_{n=1}^{+\infty } \beta _{n} < + \infty \). Under \(\eta \in (1/2,1]\), algorithm (22) satisfies that

$$\begin{aligned} \liminf_{n \to + \infty } \mathbb{E} \Biggl[ \sum _{t=1}^{T} f_{t} (x_{n}) - \sum _{t=1}^{T} f_{t} \bigl(x^{*}\bigr) \Biggr] = 0. \end{aligned}$$

Moreover, under \(\eta \in [1/2,1)\), any accumulation point of \((\tilde{x}_{n} := (1/n) \sum_{k=1}^{n} x_{k})_{n\in \mathbb{N}}\) almost surely belongs to the solution set of problem (21), and algorithm (22) achieves the following rate of convergence:

$$\begin{aligned} \mathbb{E} \Biggl[ \sum_{t=1}^{T} f_{t} (\tilde{x}_{n} ) - \sum_{t=1}^{T} f_{t} \bigl(x^{*} \bigr) \Biggr] = \mathcal{O} \biggl( \frac{1}{n^{1-\eta }} \biggr). \end{aligned}$$

Proof

For problem (21), Corollary 5.3 implies that \(0 \leq \liminf_{n\to + \infty } \mathbb{E}[f(x_{n}) - f^{\star }] \leq 0\) and \(0 \leq \limsup_{n\to + \infty } \mathbb{E}[f(\tilde{x}_{n}) - f^{\star }] \leq 0\), where \(f := (1/T) \sum_{t=1}^{T} f_{t}\). The second inequality guarantees that \(\lim_{n\to + \infty } \mathbb{E}[f(\tilde{x}_{n}) - f^{\star }] = 0\). Let \(\hat{x} \in X\) be an arbitrary accumulation point of \((\tilde{x}_{n})_{n\in \mathbb{N}} \subset X\). Since there exists \((\tilde{x}_{n_{i}})_{i\in \mathbb{N}} \subset (\tilde{x}_{n})_{n\in \mathbb{N}}\) such that \((\tilde{x}_{n_{i}})_{i\in \mathbb{N}}\) converges almost surely to \(\hat{x} \in X\), the continuity of f ensures that \(0 = \lim_{i\to + \infty } \mathbb{E}[f(\tilde{x}_{n_{i}}) - f^{\star }] = \mathbb{E}[f(\hat{x}) - f^{\star }]\), i.e., \(\hat{x} \in X^{\star }\). The rate of convergence of \((\tilde{x}_{n})_{n\in \mathbb{N}}\) is obtained from Corollary 5.1. □

It is not guaranteed that \(x_{T}\) defined by AMSGrad with \(\lambda _{t} : = \alpha /\sqrt{t}\) optimizes \(\sum_{t=1}^{T} f_{t}\) over X since (59) depends on a given parameter T, i.e.,

$$\begin{aligned} \frac{R(T)}{T} \leq \mathcal{O} \biggl( \sqrt{\frac{1+\ln T}{T}} \biggr). \end{aligned}$$

Meanwhile, Corollary 5.3 implies that any accumulation point of \((\tilde{x}_{n})_{n\in \mathbb{N}}\) defined by algorithm (22) with \(\lambda _{n} := 1/\sqrt{n}\) almost surely belongs to the set of minimizers of \(\sum_{t=1}^{T} f_{t}\) over X and \((\tilde{x}_{n})_{n\in \mathbb{N}}\) achieves an \(\mathcal{O}(1/\sqrt{n})\) convergence rate, i.e.,

$$\begin{aligned} \mathbb{E} \Biggl[ \sum_{t=1}^{T} f_{t} (\tilde{x}_{n} ) - \sum_{t=1}^{T} f_{t} \bigl(x^{*} \bigr) \Biggr] = \mathcal{O} \biggl( \frac{1}{\sqrt{n}} \biggr). \end{aligned}$$

Numerical comparisons

In this section, we consider the classifier ensemble problem [18, Sect. 2.2.2], [19, Sect. 3.2.2], [17, Problem II.1] (see problems (23) and (25) in Example 4.1 (ii)) and compare the performances of the learning methods based on the following algorithms which used commonly \(\beta =0.99\) [9, Sect. 5] and \(\alpha _{n} = 1/2\) (\(n\in \mathbb{N}\)).

SG::

Stochastic gradient algorithm (15) with \(\lambda _{n} \in [10^{-3}/(n+1), 1/(n+1)]\) computed by the Armijo line search algorithm [17, Algorithms 2 and 3, LS].

C1::

Algorithm 1 with (19) and \(\beta _{n} = \lambda _{n} = 10^{-1}\).

C2::

Algorithm 1 with (19) and \(\beta _{n} = \lambda _{n} = 10^{-3}\).

C3::

Algorithm 1 with (20) and \(\beta _{n} = \lambda _{n} = 10^{-1}\).

C4::

Algorithm 1 with (20) and \(\beta _{n} = \lambda _{n} = 10^{-3}\).

D1::

Algorithm 1 with (19), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} = 10^{-1}/\sqrt{n+1}\).

D2::

Algorithm 1 with (19), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} = 10^{-3}/\sqrt{n+1}\).

D3::

Algorithm 1 with (19), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} \in [10^{-3}/\sqrt{n+1}, 1/\sqrt{n+1}]\) computed by the Armijo line search algorithm.

D4::

Algorithm 1 with (20), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} = 10^{-1}/\sqrt{n+1}\).

D5::

Algorithm 1 with (20), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} = 10^{-3}/\sqrt{n+1}\).

D6::

Algorithm 1 with (20), \(\beta _{n} = 0.9/2^{n}\), and \(\lambda _{n} \in [10^{-3}/\sqrt{n+1}, 1/\sqrt{n+1}]\) computed by the Armijo line search algorithm.

The step-size \(\beta _{n} := 0.9/2^{n}\) used in D1–D6 was based on [9, Sect. 5]. The numerical results in [17] showed that the learning method based on SG performed better than the existing methods in [19, (18)]. Therefore, we compare the performance of the learning method based on SG with the one of the learning methods based on C1–D6. See Corollary 1 in [17], Theorems 5.2 and 5.3, and Corollary 5.1 for convergence analyses of the above algorithms for solving problems (23) and (25).

The experiments used Mac Pro (Late 2013) with a 3.5 GHz 6-core Intel Xeon E5 CPU, 32 GB 1866 MHz DDR3 memory, and macOS Catalina version 10.15.1 operating system. The algorithms used in the experiments were written in Python 3.7.5 with the NumPy 1.17.4 package. The experiments used the datasets from LIBSVM [37] and the UCI Machine Learning Repository [38] for which information is shown in Table 1. In these experiments, stratified 10-fold cross-validation for the datasets was performed. For this validation, the StratifiedKFold class in the scikit-learn 0.21.3 package was used. Ensembles of support vector classifiers were constructed by the BaggingClassifier class in the scikit-learn 0.21.3 package. The number of base estimators was set as the default value of the scikit-learn package. For learning multiclass classification tasks with the classifiers used in the experiments, the one-vs-the-rest multiclass classification strategy implemented as the OneVsRestClassifier class in the scikit-learn 0.21.3 package was used. The stopping condition for the algorithms used in the experiments was \(n=100\).

Table 1 Datasets used for classification

Let us consider problem (23) and compare the performances of the sparsity learning methods based on the algorithms with \(Q_{\mathsf{H}_{n}}\) defined by (24). Although we can consider problem (25) and compare the performances of the sparsity and diversity learning methods based on the algorithms with \(Q_{\mathsf{H}_{n}}\) defined by (26), we omit the details due to lack of space.Footnote 7

Tables 2 and 3 show that the accuracy of the learning method based on SG was almost the same as that of the learning methods based on C1, C2, C3, C4, D3, D4, and D6. These tables also show that the elapsed times for the proposed learning methods were shorter than the elapsed times for the learning method based on SG.

Table 2 Classification accuracies (%) and elapsed times (s) for the sparsity learning methods based on SG, C1, C2, C3, and C4 applied to the datasets in Table 1
Table 3 Classification accuracies (%) and elapsed times (s) for the sparsity learning methods based on D1, D2, D3, D4, D5, and D6 applied to the datasets in Table 1

The average accuracies and elapsed times of the existing learning method (SG) were compared to the average accuracies and elapsed times of the proposed learning methods (C1–D6) by using an analysis of variance (ANOVA) test and Tukey–Kramer’s honestly significant difference (HSD) test. The scipy.stats.f_oneway method in the SciPy library was used as the implementation of the ANOVA test, and the statsmodels.stats.multicomp.pairwise_tukeyhsd method in the StatsModels package was used as the implementation of Tukey–Kramer’s HSD test. Recall that the ANOVA test examines whether the hypothesis that the given groups have the same population mean is rejected, whereas Tukey–Kramer’s HSD test can be used to find specifically which pair has a significant difference in groups. The significance level was set at 5% (0.05) for the ANOVA and Tukey–Kramer’s HSD tests. The p-value computed by the ANOVA test for the accuracies was about \(4.09 \times 10^{-19}\) (<0.05). Table 4 indicates that the adjusted p-value between each of the learning methods based on C1, C2, C3, C4, D3, D4, and D6 and the existing learning method based on SG was greater than 0.05. This implies that the existing and proposed methods based on C1, C2, C3, C4, D3, D4, and D6 had almost the same performances in the sense of accuracy. The p-value computed by the ANOVA test for the elapsed time was about \(2.67 \times 10^{-29}\) (<0.05). Table 5 indicates that there is a significant difference in the sense of the elapsed time between each of the proposed methods and the existing method based on SG. Therefore, the proposed methods ran significantly faster than the existing method based on SG.

Table 4 Multiple comparison for accuracies for the sparsity learning methods applied to the datasets in Table 1 using Tukey–Kramer’s HSD test at the 5% significance level (“meandiffs” indicates the pairwise mean differences between Groups 1 and 2, “p-adj” indicates the adjusted p-value, and “Lower” (resp. “Upper”) indicates the lower (resp. upper) value of the confidence interval for the pairwise mean differences)
Table 5 Multiple comparison for elapsed time for the sparsity learning methods applied to the datasets in Table 1 using Tukey–Kramer’s HSD test at the 5% significance level (“meandiffs” indicates the pairwise mean differences between Groups 1 and 2, “p-adj” indicates the adjusted p-value, and “Lower” (resp. “Upper”) indicates the lower (resp. upper) value of the confidence interval for the pairwise mean differences)

Conclusion

In this paper, we proposed a stochastic approximation method based on adaptive learning rate optimization algorithms for solving a convex stochastic optimization problem over the fixed point set of a quasinonexpansive mapping. It also presented convergence analyses of the proposed method with constant and diminishing step-sizes. The analyses confirm that any accumulation point of the sequence generated by the proposed method almost surely belongs to the solution set of the stochastic optimization problem in deep learning. We also compared the proposed algorithm with the existing adaptive learning rate optimization algorithms and showed that the proposed algorithm achieved an \(\mathcal{O}(1/\sqrt{n})\) convergence rate which was not achieved for the existing adaptive learning rate optimization algorithms. Numerical results for the classifier ensemble problems demonstrated that the proposed learning methods achieve high accuracies faster than the existing learning method based on the first-order algorithm. In particular, the proposed methods with constant step-sizes or Armijo line search step-sizes solve the classifier ensemble problems faster than the existing method based on the first-order algorithm.

Availability of data and materials

Not applicable.

Notes

  1. 1.

    See (6) and (9) for the definitions of Adam and AMSGrad.

  2. 2.

    See [23, Lemma 3.1], [22, Proposition 2.3], [24, Subchapter 4.3] for the definition and properties of the subgradient projection when \(H= I\).

  3. 3.

    The projection \(P_{C,\mathsf{H}_{n}}\) onto a half-space \(C := \{ x\in \mathbb{R}^{N} \colon \langle a ,x \rangle \leq b \} = \mathrm{Fix}(P_{C}) = \mathrm{Fix}(P_{C,\mathsf{H}_{n}})\) under the \(\mathsf{H}_{n}\)-norm, where \(a\neq 0\) and \(b\in \mathbb{R}\), can be defined for all \(x\in \mathbb{R}^{N}\) by \(P_{C,\mathsf{H}_{n}}(x) := x + [(b - \langle a ,x \rangle _{ \mathsf{H}_{n}})/\|a\|_{\mathsf{H}_{n}}^{2}] a\) (\(x\notin C\)) or \(P_{C,\mathsf{H}_{n}}(x) := x\) (\(x\in C\)).

  4. 4.

    Condition (30) is satisfied when \(\mathsf{H}_{n}\) is defined by either (19) or (20).

  5. 5.

    Condition (40) is satisfied when \(\mathsf{H}_{n}\) is defined by either (19) or (20).

  6. 6.

    Since AMSGrad is applied to constrained convex optimization, in general, \(\lim_{T \to + \infty } \|g_{1:T,i}\| \neq 0\) and \(\|g_{1:T,i} \| \leq \hat{D}_{3} \sqrt{T}\) hold [8, Corollary 4.2].

  7. 7.

    We checked that the sparsity and diversity learning methods based on C1, C2, C3, C4, D3, D4, and D6 with \(Q_{\mathsf{H}_{n}}\) defined by (26) perform better than the learning method based on SG, as seen in the results (Tables 2, 3, 4, and 5) for ensemble learning with sparsity.

References

  1. 1.

    Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, Cambridge (2008)

    Book  Google Scholar 

  2. 2.

    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)

    MATH  Google Scholar 

  3. 3.

    Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Nedić, A., Lee, S.: On stochastic subgradient mirror-descent algorithm with weighted averaging. SIAM J. Optim. 24, 84–107 (2014)

    MathSciNet  Article  Google Scholar 

  5. 5.

    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: a generic algorithmic framework. SIAM J. Optim. 22, 1469–1492 (2012)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  8. 8.

    Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, pp. 1–15 (2015)

    Google Scholar 

  9. 9.

    Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. In: Proceedings of the International Conference on Learning Representations, pp. 1–23 (2018)

    Google Scholar 

  10. 10.

    Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)

    Book  Google Scholar 

  11. 11.

    Berinde, V.: Iterative Approximation of Fixed Points. Springer, Berlin (2007)

    MATH  Google Scholar 

  12. 12.

    Halpern, B.: Fixed points of nonexpanding maps. Bull. Am. Math. Soc. 73, 957–961 (1967)

    MathSciNet  Article  Google Scholar 

  13. 13.

    Krasnosel’skiĭ, M.A.: Two remarks on the method of successive approximations. Usp. Mat. Nauk 10, 123–127 (1955)

    MathSciNet  Google Scholar 

  14. 14.

    Mann, W.R.: Mean value methods in iteration. Proc. Am. Math. Soc. 4, 506–510 (1953)

    MathSciNet  Article  Google Scholar 

  15. 15.

    Nakajo, K., Takahashi, W.: Strong convergence theorems for nonexpansive mappings and nonexpansive semigroups. J. Math. Anal. Appl. 279, 372–379 (2003)

    MathSciNet  Article  Google Scholar 

  16. 16.

    Wittmann, R.: Approximation of fixed points of nonexpansive mappings. Arch. Math. 58, 486–491 (1992)

    MathSciNet  Article  Google Scholar 

  17. 17.

    Iiduka, H.: Stochastic fixed point optimization algorithm for classifier ensemble. IEEE Trans. Cybern. 50, 4370–4380 (2020)

    Article  Google Scholar 

  18. 18.

    Yin, X.C., Huang, K., Hao, H.W., Iqbal, K., Wang, Z.B.: A novel classifier ensemble method with sparsity and diversity. Neurocomputing 134, 214–221 (2014)

    Article  Google Scholar 

  19. 19.

    Yin, X.C., Huang, K., Yang, C., Hao, H.W.: Convex ensemble learning with sparsity and diversity. Inf. Fusion 20, 49–58 (2014)

    Article  Google Scholar 

  20. 20.

    Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)

    Book  Google Scholar 

  21. 21.

    Borwein, J.M., Lewis, A.S.: Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, New York (2000)

    Book  Google Scholar 

  22. 22.

    Bauschke, H.H., Combettes, P.L.: A weak-to-strong convergence principle for Fejér-monotone methods in Hilbert space. Math. Oper. Res. 26, 248–264 (2001)

    MathSciNet  Article  Google Scholar 

  23. 23.

    Bauschke, H.H., Chen, J.: A projection method for approximating fixed points of quasi nonexpansive mappings without the usual demiclosedness condition. J. Nonlinear Convex Anal. 15, 129–135 (2014)

    MathSciNet  MATH  Google Scholar 

  24. 24.

    Vasin, V.V., Ageev, A.L.: Ill-Posed Problems with a Priori Information. V.S.P. Intl. Science, Utrecht (1995)

    Book  Google Scholar 

  25. 25.

    Shapiro, A., Dentcheva, D., Ruszczyński, A.: Lectures on Stochastic Programming: Modeling and Theory, 2nd edn. MOS-SIAM Series on Optimization. SIAM, Philadelphia (2014)

    Book  Google Scholar 

  26. 26.

    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)

    Article  Google Scholar 

  27. 27.

    Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization II: shrinking procedures and optimal algorithms. SIAM J. Optim. 23, 2061–2089 (2013)

    MathSciNet  Article  Google Scholar 

  28. 28.

    Goebel, K., Kirk, W.A.: Topics in Metric Fixed Point Theory. Cambridge Studies in Advanced Mathematics. Cambridge University Press, New York (1990)

    Book  Google Scholar 

  29. 29.

    Goebel, K., Reich, S.: Uniform Convexity, Hyperbolic Geometry, and Nonexpansive Mappings. Dekker, New York (1984)

    MATH  Google Scholar 

  30. 30.

    Takahashi, W.: Nonlinear Functional Analysis. Yokohama Publishers, Yokohama (2000)

    MATH  Google Scholar 

  31. 31.

    Yamada, I.: The hybrid steepest descent method for the variational inequality problem over the intersection of fixed point sets of nonexpansive mappings. In: Butnariu, D., Censor, Y., Reich, S. (eds.) Inherently Parallel Algorithms for Feasibility and Optimization and Their Applications, pp. 473–504. Elsevier, New York (2001)

    Google Scholar 

  32. 32.

    Yamada, I., Ogura, N.: Hybrid steepest descent method for variational inequality problem over the fixed point set of certain quasi-nonexpansive mappings. Numer. Funct. Anal. Optim. 25, 619–655 (2004)

    MathSciNet  Article  Google Scholar 

  33. 33.

    Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985)

    Book  Google Scholar 

  34. 34.

    Wanka, G., Wilfer, O.: Formulae of epigraphical projection for solving minimax location problems. Pac. J. Optim. 16, 289–313 (2020)

    MathSciNet  MATH  Google Scholar 

  35. 35.

    Nedić, A., Ozdaglar, A.: Approximate primal solutions and rate analysis for dual subgradient methods. SIAM J. Optim. 19, 1757–1780 (2009)

    MathSciNet  Article  Google Scholar 

  36. 36.

    Iiduka, H.: Distributed optimization for network resource allocation with nonsmooth utility functions. IEEE Trans. Control Netw. Syst. 6, 1354–1365 (2019)

    MathSciNet  Article  Google Scholar 

  37. 37.

    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27 (2011)

    Article  Google Scholar 

  38. 38.

    Dua, D., Graff, C.: UCI Machine learning repository. School Inf. Comput. Sci., Univ. California at Irvine, Irvine, CA, USA (2019)

Download references

Acknowledgements

The author would like to thank Professor Heinz Bauschke, Professor Yunier Bello-Cruz, Professor Radu Ioan Bot, Professor Robert Csetnek, and Professor Alexander Zaslavski for giving him a chance to submit his paper to this special issue. The author is sincerely grateful to Editor-in-Chief Yunier Bello-Cruz and the two anonymous reviewers for helping him improve the original manuscript. The author thanks Hiroyuki Sakai for his input on the numerical examples.

Funding

This work was supported by the Japan Society for the Promotion of Science (JSPS KAKENHI Grant Number JP18K11184).

Author information

Affiliations

Authors

Contributions

HI developed the mathematical methods. HI discussed the results and contributed to the final manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hideaki Iiduka.

Ethics declarations

Competing interests

The author declares that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Iiduka, H. Stochastic approximation method using diagonal positive-definite matrices for convex optimization with fixed point constraints. Fixed Point Theory Algorithms Sci Eng 2021, 10 (2021). https://doi.org/10.1186/s13663-021-00695-3

Download citation

MSC

  • 65K05
  • 65K15
  • 90C15

Keywords

  • Adaptive learning rate optimization algorithms
  • Convex stochastic optimization
  • Fixed point
  • Quasinonexpansive mapping
  • Stochastic fixed point optimization algorithm
  • Stochastic subgradient