Skip to main content

Random block-coordinate methods for inconsistent convex optimisation problems

Abstract

We develop a novel randomised block-coordinate primal-dual algorithm for a class of non-smooth ill-posed convex programs. Lying midway between the celebrated Chambolle–Pock primal-dual algorithm and Tseng’s accelerated proximal gradient method, we establish global convergence of the last iterate as well as optimal \(O(1/k)\) and \(O(1/k^{2})\) complexity rates in the convex and strongly convex case, respectively, k being the iteration count. Motivated by the increased complexity in the control of distribution-level electric-power systems, we test the performance of our method on a second-order cone relaxation of an AC-OPF problem. Distributed control is achieved via the distributed locational marginal prices (DLMPs), which are obtained as dual variables in our optimisation framework.

1 Introduction

In this paper we study non-smooth convex composite optimisation problems of the form

$$ \min_{\boldsymbol{x}\in \mathcal{X}}\bigl\{ \Psi ( \boldsymbol{x}):=\Phi (\boldsymbol{x})+R(\boldsymbol{x}) \bigr\} , $$
(P)

where \(\mathcal{X}:=\arg \min \{\frac{1}{2} \Vert \boldsymbol{A}\boldsymbol{z}-\boldsymbol{b} \Vert ^{2} \vert \boldsymbol{z}\in \mathbb{R}^{m} \}\), and the functions \(\Phi :\mathbb{R}^{m}\to \mathbb{R}\) and \(R:\mathbb{R}^{m}\to (-\infty ,\infty ]\) are convex and additively decomposable of the form \(\Phi (\boldsymbol{x}):=\sum_{i=1}^{d}\Phi _{i}(\boldsymbol{x}_{i})\) and \(R(\boldsymbol{x}):=\sum_{i=1}^{d}r_{i}(\boldsymbol{x}_{i})\). We assume that each function \(\phi _{i}:\mathbb{R}^{m_{i}}\to \mathbb{R}\) is smooth, whereas \(r_{i}(\cdot )\) is only required to be proper convex and lower semi-continuous. We typically think of the smooth component \(\phi _{i}(\cdot )\) as a convex loss function, whereas \(r_{i}(\cdot )\) can take over the role of a regularising or penalty function. In particular, \(r_{i}(\cdot )\) can represent an indicator function of a closed, convex set \(\mathcal{K}_{i}\subset \mathbb{R}^{m_{i}}\), representing individual membership constraints of the decision variable \(\boldsymbol{x}_{i}\in \mathbb{R}^{m_{i}}\). Thus, problem (P) can include certain separable block constraints in addition to the non-separable constraints embodied in the set \(\mathcal{X}\). This setting gains relevance in distributed optimisation problems with joint coupling constraints, such as multi-agent optimisation problems [1]. Other examples include convex penalty functions, like the \(\ell _{1}\)-norm on \(\mathbb{R}^{m_{i}}\). The matrix \(\boldsymbol{A}=[\boldsymbol{A}_{1};\ldots ;\boldsymbol{A}_{d}]\) is decomposed in \(q\times m_{i}\) matrices \(\boldsymbol{A}_{i}\in \mathbb{R}^{q\times m_{i}}\). Accordingly, we use the notation \(\boldsymbol{x}=[\boldsymbol{x}_{1};\ldots ;\boldsymbol{x}_{d}]\) with each \(\boldsymbol{x}_{i}\in \mathbb{R}^{m_{i}}\) to represent the blocks of coordinates of x, and \(m=m_{1}+\cdots +m_{d}\). The joint restriction \(\boldsymbol{x}\in \mathcal{X}\) states that a feasible decision variable is a solution to a linear least-squares problem and could be equivalently described as the set of solutions to the normal equations \(\mathcal{X}=\{\boldsymbol{x}\in \mathbb{R}^{m}\vert \boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{x}=\boldsymbol{A}^{\top}\boldsymbol{b}\}\). When b is in the range of A, then we can solve the linear system \(\boldsymbol{A}\boldsymbol{x}=\boldsymbol{b}\) exactly, and the optimisation problem reduces to a linearly constrained convex, non-smooth minimisation problem

$$ \min_{\boldsymbol{x}}\bigl\{ \Psi (\boldsymbol{x})=\Phi (\boldsymbol{x})+R(\boldsymbol{x})\bigr\} \quad \text{s.t.: } \boldsymbol{A} \boldsymbol{x}=\boldsymbol{b}. $$
(1)

We call this the consistent case. Problems of type (1) are very general and include all generic conic optimisation problems, such as linear programming, second-order cone optimisation and semi-definite programming. In particular, partitioning the matrix A into suitably defined blocks, problem (1) is a canonical formulation of various distributed optimisation problems, as the next examples illustrate.

Example 1

(Consensus)

Consider the problem

$$ \min_{\boldsymbol{x}\in \mathbb{R}^{p}}\sum_{i=1}^{n}r_{i}( \boldsymbol{x}), $$

where \(r_{1},\ldots ,r_{d}\) are closed, convex functions on \(\mathbb{R}^{p}\). This problem is equivalent to

$$ \min_{\boldsymbol{x}_{1},\ldots ,\boldsymbol{x}_{d}}\sum_{i=1}^{n}r_{i}( \boldsymbol{x}_{i}) \quad \text{s.t.: }\boldsymbol{x}_{1}= \boldsymbol{x}_{2}=\cdots =\boldsymbol{x}_{n}. $$

This can be written as a linear constrained optimisation problem with matrix A corresponding to the Laplacian of a connected undirected graph and linear constraint \(\boldsymbol{A}\boldsymbol{x}=0\).

Example 2

(Distributed model fitting)

Problem (P) can also cover composite minimisation problems that canonically arise in machine learning. Consider the problem

$$ \min_{\boldsymbol{u}\in \mathbb{R}^{p}}\ell (\boldsymbol{K}\boldsymbol{u}- \boldsymbol{b})+\rho ( \boldsymbol{u}), $$

where \(\ell (\cdot )\) is a convex and smooth statistical loss function of the form

$$ \ell (\boldsymbol{K}\boldsymbol{u}-\boldsymbol{b})=\sum _{i=1}^{q}\ell _{i}\bigl( \boldsymbol{K}_{i}^{\top} \boldsymbol{u}-b_{i} \bigr), $$

where \(\ell _{i}:\mathbb{R}\to \mathbb{R}\) is the loss for the ith training example, \(\boldsymbol{K}_{i}\in \mathbb{R}^{p}\) is the feature vector for example i, and \(b_{i}\) is the output or response, for example i. Define the variable x=( u v ) R p + q = R m and the linear operator \(\boldsymbol{A}\boldsymbol{x}=\boldsymbol{K}\boldsymbol{u}-\boldsymbol{v}\). Define the functions \(\phi _{i}:\mathbb{R}\to \mathbb{R}\) and \(r_{i}:\mathbb{R}\to (-\infty ,\infty ]\) by

$$ \phi _{i}(\boldsymbol{x}_{i}):=\textstyle\begin{cases} \ell _{i}(v_{i}) & \text{if }p+1\leq i\leq p+d, \\ 0 & \text{if }1\leq i\leq p, \end{cases}\displaystyle \qquad r_{i}(\boldsymbol{x}_{i}):=\textstyle\begin{cases} \rho _{i}(u_{i}) & \text{if }1\leq i\leq p, \\ 0 & \text{if }p+1\leq i\leq m. \end{cases} $$

Setting \(\Phi (\boldsymbol{x})=\sum_{i=1}^{m}\phi _{i}(\boldsymbol{x}_{i})\) and \(R(\boldsymbol{x})=\sum_{i=1}^{m}r_{i}(\boldsymbol{x}_{i})\) yields the convex optimisation problem

$$ \min_{\boldsymbol{x}}\Phi (\boldsymbol{x})+R(\boldsymbol{x})\quad \text{s.t.: }\boldsymbol{A}\boldsymbol{x}= \boldsymbol{b}. $$

Problem (P) is more general than the examples just presented, since we do not insist on b belonging to the range of A. Hence, the optimisation problems of interest to us are non-smooth convex problems with inconsistent linear restrictions under which the exact condition \(\boldsymbol{A}\boldsymbol{x}=\boldsymbol{b}\) is replaced by the approximate condition \(\boldsymbol{A}\boldsymbol{x}\approx \boldsymbol{b}\). This assumption gains relevance in inverse problems and PDE-constrained optimisation, where problems of the form (P) appear in the method of residuals (Morozov regularisation) [2]. Another instance where ill-posed linear restrictions appear is studied in an application to power systems in Sect. 5 of this work. Motivated by all these different optimisation problems, this paper derives a unified random block-coordinate method that solves problem (P) under very general assumptions on the problem data.

1.1 Related methods

Our algorithms are closely related to randomised coordinate descent methods, primal-dual coordinate update methods and accelerated primal-dual methods. In this subsection, let us briefly review these three classes of methods and discuss their relations to our work.

Linear constrained minimisation

From the theoretical standpoint, one could formulate problem (P) as a linear constrained optimisation problem of the type (1), with linear restriction \(\boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{x}=\boldsymbol{A}^{\top}\boldsymbol{b}\) (the normal equations). Hence, one could approach problem (P) via primal-dual techniques directly. While we will show that our main algorithmic scheme (Algorithm 1) is indeed equivalent to a suitably defined primal-dual method, it can be argued that this connection is not always a recommended solution strategy in practice. First, a primal-dual implementation has to deal with the symmetric matrix \(\boldsymbol{A}^{\top}\boldsymbol{A}\), whose dimension is \(m\times m\). If q is much smaller than m then primal-dual methods would have to process many more data points than direct approaches. Secondly, if A is a sparse matrix, then \(\boldsymbol{A}^{\top}\boldsymbol{A}\) is usually no longer sparse, which leads to heavy use of numerical linear algebra techniques. Finally, it is often not known a-priori whether the linear system is de facto consistent with the given inputs. This is in particular the case in the application motivating the development of the numerical scheme to be presented in this paper, which is concerned with the distributed optimisation of an electrical distribution system within an AC-optimal power-flow framework [3, 4]. A basic stability desideratum on a numerical solution method (acting as a decentralised coordination mechanism in our application) is therefore that overall system convergence is guaranteed even if the linear constraints are not satisfied with equality. Our method exactly achieves this. Section 5 illustrates the performance of our method on a 15-bus AC-OPF problem taken from [5].

Algorithm 1
figure a

Distributed accelerated proximal gradient algorithm

Primal penalty methods

An alternative and popular approach to solve (P) is the penalty method. It consists in solving a sequence of unconstrained optimisation problems \(\min_{\boldsymbol{x}} \Psi (\boldsymbol{x})+\frac{\sigma _{k}}{2} \Vert \boldsymbol{A}\boldsymbol{x}-\boldsymbol{b} \Vert ^{2} \), where \(\sigma _{k}\) is a positive and increasing penalty parameter sequence. Intuitively, it is clear that, by choosing \(\sigma _{k}\) larger, the more importance the optimisation gives to the constraints. Since penalty methods are entirely primal, they do not use duality arguments, and hence they are in principle able to solve the inconsistent case as well. However, their implementation usually involves two loops: an inner loop solving the minimisation problem for a fixed parameter \(\sigma _{k}\) to some desired accuracy, followed by an outer loop describing how the penalty parameter is updated. Viewed from this perspective, our algorithm is performing these operations in a single-loop fashion.

Randomised coordinate descent methods

In the absence of the linear constraints, our algorithm specialises to randomised coordinate descent (RCD), which was first proposed in [6], and later generalised to the non-smooth case in [7, 8]. It was shown that RCD features sublinear rates of convergence with rate \(O(1/k)\), k being the iteration counter. Acceleration to a \(O(1/k^{2})\) complexity and even linear rates for strongly convex problems has been obtained. Extensions to parallel computations, important for large-scale optimisation problems, were first proposed in [9].

Primal-dual coordinate update methods

To cope with linear constraints, a very popular approach is the alternating direction method of multipliers (ADMM). Originally, ADMM [10, 11] was proposed for two-block structured problems with separable objective. The convergence and complexity analysis of this method is well documented in the literature [12]; see [13] for a survey. Direct extensions of the ADMM to multi-block settings such as (P) are not straightforward, and indeed even may fail to converge [14]. Very recently, [15] proposed a randomised primal-dual coordinate (RPDC) update method, whose asynchronous parallel version was then studied in [16]. It was shown that RPDC converges with rate \(O(1/k)\) under the convexity assumption. Improved complexity statements for multi-block ADMM can be found in [17].

Accelerated primal-dual methods

It is possible to accelerate the rate of convergence from \(O(1/k)\) to \(O(1/k^{2})\) for gradient-type methods. The first acceleration result was shown by [18] for solving smooth, unconstrained problems. The technique has been generalised to accelerated gradient-type methods on possibly non-smooth, convex programs [19, 20]. Primal-dual methods on solving linearly constrained problems can also be accelerated by similar techniques. Under the convexity assumption, the augmented Lagrangian method (ALM) is accelerated from \(O(1/k)\) to \(O(1/k^{2})\) in [21].

1.2 Contribution

Methodological contributions

We propose a block-coordinate implementation of the method developed by [22] for linearly constrained optimisation, lying midway between the celebrated primal-dual hybrid gradient algorithm [23] and Tseng’s accelerated proximal gradient method [24]. Specifically, our proposed method is a distributed interpretation of the primal-dual algorithm of [23] that operates on randomly selected coordinate blocks. A parallel between the Chambolle–Pock method [23] and the accelerated proximal gradient of [24] was already drawn in [22, 25]. Reducing the primal-dual algorithm to an implementation of Tseng’s method enabled [25] to derive new convergence results based on primal arguments, thus departing from strong duality requirements and the ergodic rates typically issued for primal-dual methods. Our developments revisit the coordinate-descent implementation proposed in [26] for the basic algorithm of [22], to a block-coordinate descent setting featuring a Nesterov-type acceleration. In particular, in the strongly convex case, we derive a new step size policy that achieves an accelerated rate of \(O(k^{-2})\). In addition to the recent preprint [27], we are not aware of a similar algorithm achieving the accelerated convergence rates \(O(k^{-2})\) in a fully distributed computational setting. Thus, our main contributions in this paper can be summarised as follows:

  1. (i)

    In the convex case, provided problem (P) possesses a saddle point (defined in Sect. 2.6) our main result is Theorem 6 that establishes an \(O(1/k)\) iteration complexity in terms of the objective function gap and convergence of the last iterate.

  2. (ii)

    In the strongly convex case and uniform sampling of coordinate blocks, our main result is Theorem 8, which proves an accelerated \(O(k^{-2})\) convergence rate in the primal objective function values.

We remark that RCD methods have been shown to exhibit linear convergence rates in the strongly convex regime [6]. Such fast convergence is, however, not to be expected in the presence of linear constraints, while strong convexity of the primal objective ensures smoothness of the Lagrangian dual function, but not its strong concavity. Hence, in general, we do not expect to see linear convergence rates by only assuming strong convexity in the primal. However, we note that [28] obtain linear convergence rates in the consistent regime if there is one block variable that is independent of all others in the objective (but coupled in the linear constraint) and also the corresponding component function is smooth.

Related to this paper is also the very recent work [29]. They consider a larger class of convex optimisation problems with linear constraints and design a new randomised primal-dual algorithm with single block activation in each iteration and similar complexity results as reported in the present work. However, our method is able to solve inconsistent convex optimisation problems with general sampling techniques.

1.3 Organisation of this paper

This paper is organised as follows. Section 2 describes our block-coordinate descent framework. Section 3 explains in detail our algorithmic approach. Section 4 contains all details for the asymptotic convergence and finite-time complexity statements. Section 5 describes a challenging application of our algorithm to a distributed optimisation formulation of an AC-OPF problem formulate the distribution grid model, based on the second-order cone relaxation of [30, 31]. Preliminary numerical results are reported to show the applicability of our method using a 15-bus network studied in [5] as a concrete example.

2 Preliminaries

2.1 Notation

We work in the space \(\mathbb{R}^{p}\) composed by column vectors. For \(\boldsymbol{x},\boldsymbol{u}\in \mathbb{R}^{p}\) denote the standard Euclidean inner product \(\langle{\boldsymbol{x}},{\boldsymbol{u}}\rangle =\boldsymbol{x}^{\top}\boldsymbol{u}\) and the Euclidean norm \(\Vert \boldsymbol{x} \Vert =\langle{\boldsymbol{x}},{\boldsymbol{x}}\rangle ^{1/2}\). We let \(\mathbb{S}^{p}_{+}:=\{\boldsymbol{B}\in \mathbb{R}^{p\times p}\vert \boldsymbol{B}^{ \top}=\boldsymbol{B},\boldsymbol{B}\succeq 0\}\) the space of positive-definite matrices. Given \(\Lambda \in \mathbb{S}^{p}_{+}\), we let \(\|\boldsymbol{x}\|_{\Lambda}:=\langle{\Lambda \boldsymbol{x}},{\boldsymbol{x}}\rangle ^{1/2}\) for \(\boldsymbol{x}\in \mathbb{R}^{p}\). The identity matrix of dimension p is denoted as \(\boldsymbol{I}_{p}\). Whenever the dimension is clear from the context, we will just write I. We denote by \(\lambda _{\max}(\boldsymbol{A})\) the largest eigenvalue of a square \(p\times p\) matrix A. We call by \(\Gamma _{0}(\mathbb{R}^{p})\) the set of proper convex, lower semi-continuous functions \(f:\mathbb{R}^{p}\to (-\infty ,+\infty ]\). For such a function \(f\in \Gamma _{0}(\mathbb{R}^{p})\) the effective domain is defined as \(\mathrm{dom}(f):=\{\boldsymbol{x}\vert f(\boldsymbol{x})<\infty \}\). For \(d\in \mathcal{N}\), we set \([d]:=\{1,\ldots ,d\}\). For a vector \(\nu \in \mathbb{R}^{p}_{++}\), we let \(\nu ^{-1}\) the vector of reciprocal values. For \(\Gamma \in \mathbb{S}^{m}_{++}\), define the weighted proximal operator \(\operatorname{\mathtt{prox}}^{\Gamma}_{r}(\boldsymbol{x}):=\arg \min_{\boldsymbol{u}\in \mathbb{R}^{m}}\{r( \boldsymbol{u})+\frac{1}{2} \Vert \boldsymbol{u}-\boldsymbol{x} \Vert _{\Gamma}^{2}\}\). If \(\Gamma :=\operatorname{\mathtt{blkdiag}}(\Gamma _{1},\ldots ,\Gamma _{d})\), then the proximal operator decomposes accordingly \(\operatorname{\mathtt{prox}}^{\Gamma}_{r}(\boldsymbol{x})= (\operatorname{\mathtt{prox}}^{\Gamma _{1}}_{r_{1}}( \boldsymbol{x}_{1}),\ldots ,\operatorname{\mathtt{prox}}_{r_{d}}^{\Gamma _{d}}(\boldsymbol{x}_{d}) ) \). If \(f:\mathbb{R}^{m}\to \mathbb{R}\) is differentiable, we denote the gradient of f at \(\boldsymbol{x}\in \mathbb{R}^{m}\) by \(\nabla f(\boldsymbol{x})\in \mathbb{R}^{m}\).

2.2 Block structure

We first describe the block setup that has become standard in the analysis of block-coordinate methods [8, 9, 32, 33]. The block structure of (P) is given by a decomposition of \(\mathbb{R}^{m}\) into d subspaces \(\mathbb{R}^{m_{i}},1\leq i\leq d\), so that \(\mathbb{R}^{m}=\mathbb{R}^{m_{1}}\times \cdots \times \mathbb{R}^{m_{d}}\). Let \(\boldsymbol{U}=[\boldsymbol{U}_{1},\ldots ,\boldsymbol{U}_{d}]\) be the \(m\times m\) identity matrix, decomposed into column submatrices \(\boldsymbol{U}_{i}\in \mathbb{R}^{m\times m_{i}}\). For \(\boldsymbol{x}\in \mathbb{R}^{m}\), let \(\boldsymbol{x}_{i}=\boldsymbol{U}^{\top}_{i}\boldsymbol{x}\) be the block of coordinates corresponding to the columns of \(\boldsymbol{U}_{i}\). Any vector \(\boldsymbol{s}\in \mathbb{R}^{m}\) can be written as \(\boldsymbol{s}=\sum_{i=1}^{d}\boldsymbol{U}_{i}\boldsymbol{s}_{i}\). For \(\emptyset \neq I\subseteq [d]\), we write

$$ \boldsymbol{s}_{[I]}=\sum_{i\in I} \boldsymbol{U}_{i}\boldsymbol{s}_{i}. $$

We denote the \(\ell ^{2}\)-norm on \(\mathbb{R}^{m_{i}}\) as \(\Vert \cdot \Vert _{i}\). If \(\boldsymbol{Q}=\operatorname{\mathtt{blkdiag}}[\boldsymbol{Q}_{1};\ldots ,\boldsymbol{Q}_{d}]\) is a block-diagonal matrix with \(\boldsymbol{Q}_{i}\in \mathbb{S}^{m_{i}}_{++}\), we define a weighted norm on \(\mathbb{R}^{m}\) by

$$ \Vert \boldsymbol{s} \Vert _{\boldsymbol{Q}}=\sum _{i=1}^{d} \Vert \boldsymbol{s}_{i} \Vert _{\boldsymbol{Q}_{i}} \quad \forall \boldsymbol{s}\in \mathbb{R}^{m}. $$

2.3 Smoothness of Φ

We assume throughout the paper that \(\Phi :\mathbb{R}^{m}\to \mathbb{R}\) is convex and possesses a Lipschitz continuous partial gradient. Specifically, we assume that for each block \(i\in [d]\) there exists a matrix \(\Lambda _{i}\in \mathbb{S}^{m_{i}}_{++}\) so that

$$ 0 \leq \phi _{i}(\boldsymbol{x}_{i}+ \boldsymbol{t}_{i})-\phi _{i}(\boldsymbol{x}_{i})- \bigl\langle { \nabla \phi _{i}(\boldsymbol{x}_{i})},{ \boldsymbol{t}_{i}}\bigr\rangle \leq \frac{1}{2} \Vert \boldsymbol{t}_{i} \Vert ^{2}_{\Lambda _{i}}. $$
(2)

A typical situation is that \(\Lambda _{i}=L_{i}\boldsymbol{I}_{m_{i}}\) for a scalar \(L_{i}>0\), so that the gradient ϕ is Lipschitz continuous with modulus \(L_{i}>0\). Allowing for matrix-valued parameters increases generality and takes into account that norms on the factors \(\mathbb{R}^{m_{i}}\) might differ from block to block. Collecting all the matrices \(\Lambda _{i}\) in one block-diagonal matrix \(\Lambda :=\operatorname{\mathtt{blkdiag}}[\Lambda _{1};\ldots ;\Lambda _{d}]\), it follows that

$$ \Phi \bigl(\boldsymbol{x}'\bigr)-\Phi (\boldsymbol{x})-\bigl\langle {\nabla \Phi (\boldsymbol{x})},{\boldsymbol{x}'- \boldsymbol{x}}\bigr\rangle \leq \frac{1}{2} \bigl\Vert \boldsymbol{x}'-\boldsymbol{x} \bigr\Vert ^{2}_{\Lambda}. $$

2.4 Properties of R

We assume that \(R:\mathbb{R}^{m}\to (-\infty ,\infty ]\) is block separable \(R(\boldsymbol{x})=\sum_{i=1}^{d}r_{i}(\boldsymbol{x}_{i})\), where the functions \(r_{i}:\mathbb{R}^{m_{i}}\to (-\infty ,\infty ]\) are \(\mu _{i}\)-strongly convex and closed, with \(\mu _{i}\geq 0\). Calling \(\Upsilon _{i}=\mu _{i}\boldsymbol{I}_{m_{i}}\), this gives

$$ r\bigl(\boldsymbol{x}'_{i}\bigr)\geq r_{i}(\boldsymbol{x}_{i})+\bigl\langle {\xi _{i}},{\boldsymbol{x}'_{i}- \boldsymbol{x}_{i}}\bigr\rangle +\frac{1}{2} \bigl\Vert \boldsymbol{x}'_{i}-\boldsymbol{x}_{i} \bigr\Vert ^{2}_{ \Upsilon _{i}}\quad \forall \boldsymbol{x}'_{i} \in \mathbb{R}^{m_{i}}, \forall \xi \in \partial r_{i}( \boldsymbol{x}_{i}). $$
(3)

Typical examples for the regulariser \(r_{i}\) are indicator functions of closed convex sets, i.e. \(r_{i}(\boldsymbol{x}_{i})=\delta _{\mathcal{K}_{i}}(\boldsymbol{x}_{i})\), for \(\mathcal{K}_{i}\subset \mathbb{R}^{m_{i}}\) convex and closed, or structure-imposing regularisers like the \(L_{p}\)-penalty \(r_{i}(\boldsymbol{x}_{i})= \Vert \boldsymbol{x}_{i} \Vert ^{p}_{L_{p}(\mathbb{R}^{m_{i}})}\) (\(p \geq 1\)), prominent in distributed estimation of high-dimensional signals and neural networks. We let \(\Upsilon =\operatorname{\mathtt{blkdiag}}[\Upsilon _{1};\ldots ;\Upsilon _{d}]\) be the \(m\times m\) matrix collecting all strong-convexity parameters of the individual blocks.

2.5 Quadratic penalty function

Define the function

$$ h(\boldsymbol{x}):=\frac{1}{2} \Vert \boldsymbol{A} \boldsymbol{x}-b \Vert ^{2}. $$
(4)

Let \(h^{*}=\min_{\boldsymbol{x}\in \mathbb{R}^{d}} h(\boldsymbol{x})\), so that \(\mathcal{X}=\arg \min_{\mathbb{R}^{d}}h(\boldsymbol{x})\). Clearly, \(h^{\ast}\geq 0\), with equality if and only if the linear system \(\boldsymbol{A}\boldsymbol{x}=b\) is consistent.

Since h is quadratic, we have for all \(\boldsymbol{u},\boldsymbol{w},\boldsymbol{x}\in \mathbb{R}^{m}\)

$$ h(\boldsymbol{u})-h(\boldsymbol{x})=\bigl\langle {\nabla h( \boldsymbol{w})},{\boldsymbol{u}-\boldsymbol{x}}\bigr\rangle + \frac{1}{2} \bigl\Vert \boldsymbol{A}(\boldsymbol{u}-\boldsymbol{w}) \bigr\Vert ^{2}-\frac{1}{2} \bigl\Vert \boldsymbol{A}( \boldsymbol{x}-\boldsymbol{w}) \bigr\Vert ^{2}. $$
(5)

In particular, if \(\boldsymbol{x}^{*}\in \mathcal{X}\), the above implies for \(\boldsymbol{x}=\boldsymbol{w}=\boldsymbol{x}^{*}\)

$$ h(\boldsymbol{x})-h\bigl(\boldsymbol{x}^{*}\bigr)= \frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}- \boldsymbol{x}^{*}\bigr) \bigr\Vert ^{2}. $$

We also have

$$\begin{aligned} h(\boldsymbol{x}+\boldsymbol{U}_{i}\boldsymbol{t}_{i})&=h( \boldsymbol{x})+\bigl\langle {\nabla h(\boldsymbol{x})},{ \boldsymbol{U}_{i} \boldsymbol{t}_{i}}\bigr\rangle +\frac{1}{2} \Vert \boldsymbol{A}_{i}\boldsymbol{t}_{i} \Vert ^{2} \\ &= h(\boldsymbol{x})+\bigl\langle {\boldsymbol{U}_{i}^{\top } \nabla h(\boldsymbol{x})},{\boldsymbol{t}_{i}} \bigr\rangle + \frac{1}{2} \Vert \boldsymbol{t}_{i} \Vert ^{2}_{\boldsymbol{A}_{i}^{\top}\boldsymbol{A}_{i}} \\ &\leq h(\boldsymbol{x})+\bigl\langle {\boldsymbol{U}_{i}^{\top } \nabla h(\boldsymbol{x})},{\boldsymbol{t}_{i}} \bigr\rangle + \frac{\lambda _{i}}{2} \Vert \boldsymbol{t}_{i} \Vert ^{2}_{i}, \end{aligned}$$

where \(\lambda _{i}\equiv \lambda _{\max}(\boldsymbol{A}_{i}^{\top}\boldsymbol{A}_{i}) \equiv \Vert \boldsymbol{A}_{i} \Vert _{2}\), the spectral norm of the matrix \(\boldsymbol{A}_{i}\), and Lemma 9 (proven in Appendix A.1) immediately implies that for all \(t\in [0,1]\)

$$ h\bigl(t\boldsymbol{x}+(1-t)\boldsymbol{x}' \bigr)=th(\boldsymbol{x})+(1-t)h\bigl(\boldsymbol{x}'\bigr)- \frac{t(1-t)}{2} \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}- \boldsymbol{x}'\bigr) \bigr\Vert ^{2}. $$
(6)

2.6 On saddle points

The optimisation problem can be equivalently expressed as the linear constrained optimisation problem

$$ \min_{\boldsymbol{x}=(\boldsymbol{x}_{1},\ldots ,\boldsymbol{x}_{d})}\Psi (\boldsymbol{x})=\Phi ( \boldsymbol{x})+R( \boldsymbol{x})\quad \text{s.t.: }\boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{x}=\boldsymbol{A}^{ \top}b. $$
(7)

The Lagrangian associated with this non-smooth, convex optimisation problem is

$$ \mathcal{L}(\boldsymbol{x},\boldsymbol{y})=\Psi (\boldsymbol{x})+\bigl\langle { \boldsymbol{y}},{\boldsymbol{A}^{ \top }b-\boldsymbol{A}^{\top } \boldsymbol{A}\boldsymbol{x}}\bigr\rangle . $$

Definition 1

A pair \((\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})\) is called a saddle-point if

$$ 0\in \partial \Psi \bigl(\boldsymbol{x}^{\ast}\bigr)- \boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{y}^{\ast}, \qquad \boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{x}^{\ast}- \boldsymbol{A}^{\top}b=0. $$
(8)

For convex programs, the conditions (8) are sufficient for \(\boldsymbol{x}^{\ast}\) to be a solution of (P). They are also necessary if a constraint qualification condition holds (e.g. the Slater condition, stating that there exists x in the interior of the domain of Ψ such that \(\boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{x}=\boldsymbol{A}^{\top}b\)).

2.7 Random sampling

We next introduce our random-sampling strategy. Our approach is very general, and allows for virtually all existing sampling strategies considered in the literature. We refer the reader to [34, 35] for an in-depth systematic overview on this topic.

Let \((\Omega ,\mathcal{F},\mathbb{P})\) be a probability space. By a sampling we mean a random set-valued mapping with values in \(2^{[d]}\). We will call the random variable \(\mathcal{I}:\Omega \to 2^{[d]}\) a random sampling. \(\mathcal{I}(\omega )\) defines the selection of blocks employed in a single iteration of our method. The set \(\mathcal{I}(\Omega )=\Sigma \) is the set of all possible realisations of the random selection mechanism. Let \(\{\iota _{k}\}_{k\in \mathcal{N}}\) represent the stochastic process on \(2^{[d]}\) in which each random variable \(\iota _{k}\) is an i.i.d copy of \(\mathcal{I}\). We refer to such a sampling as an i.i.d sampling. We assume that the sampling \(\mathcal{I}\) is proper: There exists a vector \(\boldsymbol{\pi}=[\pi _{1},\ldots ,\pi _{d}]\in \mathbb{R}^{d}\) with

$$ \pi _{i}=\mathbb{P}(i\in \mathcal{I})\in (0,1)\quad \forall i\in [d]. $$

With the sampling \(\mathcal{I}\), we associate the matrix \(\boldsymbol{\Pi}\in \mathbb{R}^{d\times d}\) defined as

$$ (\boldsymbol{\Pi})_{ij}:=\mathbb{P}\bigl(\{i,j\}\subseteq \mathcal{I}\bigr)\quad \forall i\neq j, \quad \text{and}\quad (\boldsymbol{ \Pi})_{ii}=\pi _{i}. $$

We note that \(\boldsymbol{\Pi}\succ 0\) [35, Thm 3.1]. We further define the weighting matrix as

$$ \boldsymbol{P}:=\operatorname{\mathtt{blkdiag}}\biggl[\frac{1}{\pi _{1}}\boldsymbol{I}_{m_{1}}, \ldots , \frac{1}{\pi _{d}}\boldsymbol{I}_{m_{d}}\biggr]. $$

We emphasise that the random-sampling model we adopt here is capable of capturing many stationary randomised activation mechanisms. To illustrate this, consider the following activation mechanisms:

  • Single-coordinate activation: at each iteration, one coordinate block is activated. This means that \(\mathcal{I}(\omega )\) takes values in the discrete set \([d]\) only, i.e. \(\Sigma =[d]\). In this case, we necessarily have \(\sum_{i=1}^{d}\pi _{i}=1\).

  • Uniform Sampling: for all \(i\in [d]\) it holds that \(\mathbb{P}(i\in \mathcal{I})=\mathbb{P}(j\in \mathcal{I})\). This implies \(\pi _{i}=\frac{\mathbb{E}[\mathopen{\lvert }\mathcal{I}\mathclose{\rvert }]}{d}\) for all \(i\in [d]\). A special case of a uniform sampling is the popular class of m-nice samplings, arising under the specification, where Σ is the set of all subsets of \([d]\) containing exactly m elements, each of which is activated with uniform probability. Clearly, in this case one has \(\pi _{i}=\frac{m}{d}\) for all \(i\in [d]\).

  • Full Sampling: \(\Sigma =\{1,\ldots ,d\}\), which means that all coordinates are updated in parallel.

3 Parallel block-coordinate algorithm

Our random block-coordinate algorithm for solving (P) recursively updates three sequences \(\{(\boldsymbol{z}^{k},\boldsymbol{w}^{k},\boldsymbol{x}^{k})\}_{k\geq 0}\). Let \((\sigma _{k})_{k\geq 0}\) be a given sequence of positive numbers. At each iteration \(k\geq 0\), we define a weight sequence \((S_{k})_{k\geq 0}\) recursively by

$$ S_{0}=1,\qquad S_{k}=S_{k-1}+\sigma _{k},\qquad \theta _{k}= \frac{\sigma _{k}}{S_{k}}. $$

Consider the extrapolated point

$$ \boldsymbol{z}^{k}=(1-\theta _{k}) \boldsymbol{w}^{k}+\theta _{k}\boldsymbol{x}^{k}= \boldsymbol{w}^{k}+ \theta _{k}\bigl( \boldsymbol{x}^{k}-\boldsymbol{w}^{k}\bigr), $$
(9)

together with the sequence

$$ \boldsymbol{w}^{k+1}=\boldsymbol{z}^{k}+ \theta _{k}\boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}- \boldsymbol{x}^{k}\bigr). $$
(10)

This reads in coordinates \(\boldsymbol{w}_{i}^{k+1}=\boldsymbol{z}_{i}^{k}+\frac{\theta _{k}}{\pi _{i}}(\boldsymbol{x}_{i}^{k+1}- \boldsymbol{x}_{i}^{k})\) for all \(i\in [d]\). To evaluate \(\boldsymbol{w}^{k+1}\), we need the primal update \(\boldsymbol{x}^{k+1}\), which is obtained by a weighted forward–backward step involving the first-order signal \(g^{k}_{i}=\nabla \phi _{i}(\boldsymbol{x}^{k}_{i})+S_{k}\nabla _{i}h(\boldsymbol{z}^{k}) \). Specifically, given a block-specific scaling matrix \(\boldsymbol{B}_{i}\in \mathbb{S}^{m_{i}}_{++}\), we define \(\boldsymbol{Q}_{i}^{k}:=\frac{1}{\pi _{i}\tau _{k}}\boldsymbol{B}_{i}\), and the weighted forward–backward operator \(\operatorname{\mathsf{T}}^{k}_{i}(\boldsymbol{x}^{k})=\operatorname{\mathtt{prox}}_{r_{i}}^{\boldsymbol{Q}^{k}_{i}}(\boldsymbol{x}^{k}_{i}-( \boldsymbol{Q}^{k}_{i})^{-1}g^{k}_{i})\), which reads explicitly as

$$ \operatorname{\mathsf{T}}^{k}_{i}\bigl( \boldsymbol{x}^{k}\bigr):=\arg \min_{\boldsymbol{u}_{i}\in \mathbb{R}^{m_{i}}} \biggl\{ r_{i}(\boldsymbol{u}_{i})+\bigl\langle {g^{k}_{i}},{ \boldsymbol{u}_{i}-\boldsymbol{x}^{k}_{i}} \bigr\rangle +\frac{1}{2} \bigl\Vert \boldsymbol{u}_{i}- \boldsymbol{x}^{k}_{i} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}_{i}} \biggr\} . $$
(11)

We will choose the matrices \(\boldsymbol{B}_{i}\) later to adapt for the strong convexity present in the problem data. Putting all these tools together yields Algorithm 1.

3.1 Relation to primal-dual methods

Our method is very similar to the recent block-coordinate primal-dual update [28], who focus on the consistent case and uniform samplings [6, 9, 32, 35]. We generalise this to arbitrary samplings and show that the sequences produced by Algorithm 1 are equivalent to a primal-dual process in the spirit of [26], formulated as Algorithm 2.

Algorithm 2
figure b

Primal-dual block-coordinate descent algorithm

Let \(\{(\boldsymbol{x}^{k},\boldsymbol{w}^{k},\boldsymbol{z}^{k})\}_{k\geq 0}\) denote the sequences generated by running Algorithm 1. Let \(S_{k}\) be the cumulative step-size process \(S_{k}=1+\sum_{t=1}^{k}\sigma _{t}\). Introduce the sequence \(\boldsymbol{y}^{k}:=S_{k}(\boldsymbol{A}\boldsymbol{z}^{k}-\boldsymbol{b})\), so that

$$ S_{k}\nabla _{i}h\bigl(\boldsymbol{z}^{k} \bigr)=\boldsymbol{A}_{i}^{\top}S_{k}\bigl( \boldsymbol{A}\boldsymbol{z}^{k}- \boldsymbol{b}\bigr)= \boldsymbol{A}_{i}^{\top}\boldsymbol{y}^{k}. $$

In terms of this new dual variable \(\boldsymbol{y}^{k}\), we can reorganise the primal update so that

$$\begin{aligned} \hat{\boldsymbol{x}}^{k+1}_{i}&=\arg \min _{\boldsymbol{x}_{i}\in \mathbb{R}^{m_{i}}} \biggl\{ r_{i}(\boldsymbol{x}_{i})+ \bigl\langle {\nabla \phi _{i}\bigl(\boldsymbol{x}^{k}_{i} \bigr)+S_{k} \nabla _{i}h\bigl(\boldsymbol{z}^{k} \bigr)},{\boldsymbol{x}_{i}-\boldsymbol{x}^{k}_{i}} \bigr\rangle + \frac{1}{2} \bigl\Vert \boldsymbol{x}_{i}- \boldsymbol{x}^{k}_{i} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}_{i}} \biggr\} \\ &=\arg \min_{\boldsymbol{x}_{i}\in \mathbb{R}^{m_{i}}} \biggl\{ r_{i}( \boldsymbol{x}_{i})+ \bigl\langle {\nabla \phi _{i}\bigl( \boldsymbol{x}^{k}_{i}\bigr)+\boldsymbol{A}_{i}^{\top } \boldsymbol{y}^{k}},{ \boldsymbol{x}_{i}- \boldsymbol{x}^{k}_{i}}\bigr\rangle + \frac{1}{2} \bigl\Vert \boldsymbol{x}_{i}- \boldsymbol{x}^{k}_{i} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}_{i}} \biggr\} \\ &=\operatorname{\mathtt{prox}}^{\boldsymbol{Q}_{i}^{k}}_{r_{i}} \bigl(\boldsymbol{x}^{k}_{i}- \bigl(\boldsymbol{Q}^{k}_{i}\bigr)^{-1} \bigl( \nabla \phi _{i}\bigl(\boldsymbol{x}^{k}_{i} \bigr)+\boldsymbol{A}^{\top}_{i}\boldsymbol{y}^{k} \bigr) \bigr). \end{aligned}$$

The next iterate \(\boldsymbol{x}^{k+1}\) is obtained by the block-coordinate update rule (A.2). This gives line 3 of Algorithm 2.

Next, observe that

$$ \begin{aligned} S_{k}\bigl( \boldsymbol{A}\boldsymbol{w}^{k+1}-\boldsymbol{b}\bigr)&=S_{k} \bigl(\boldsymbol{A}\bigl(\boldsymbol{z}^{k}+ \theta _{k}\boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}- \boldsymbol{x}^{k}\bigr)-\boldsymbol{b}\bigr)\bigr) \\ &=\boldsymbol{y}^{k}+\theta _{k}S_{k} \boldsymbol{A}\boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}- \boldsymbol{x}^{k}\bigr) \\ &=\boldsymbol{y}^{k}+\sigma _{k}\boldsymbol{A} \boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr). \end{aligned} $$
(12)

Hence, after introducing the residual \(\boldsymbol{u}^{k}=\boldsymbol{A}\boldsymbol{x}^{k}-\boldsymbol{b}\), satisfying

$$ \boldsymbol{u}^{k+1}-\boldsymbol{u}^{k}= \boldsymbol{A}\bigl(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr)=\sum_{i\in \iota _{k}}\boldsymbol{A}_{i} \bigl(\hat{\boldsymbol{x}}^{k+1}_{i}-\boldsymbol{x}^{k}_{i} \bigr), $$
(13)

we obtain line 4 of Algorithm 2, as well as

$$\begin{aligned} \boldsymbol{y}^{k+1}&=S_{k+1} \bigl(\boldsymbol{A}\bigl( \boldsymbol{w}^{k+1}+\theta _{k+1}\bigl( \boldsymbol{x}^{k+1}- \boldsymbol{w}^{k+1}\bigr)\bigr)- \boldsymbol{b} \bigr) \\ &=S_{k+1}\bigl(\boldsymbol{A}\boldsymbol{w}^{k+1}- \boldsymbol{b}\bigr)+\sigma _{k+1}(\boldsymbol{A}\bigl( \boldsymbol{x}^{k+1}- \boldsymbol{w}^{k+1}\bigr) \\ &=S_{k}\bigl(\boldsymbol{A}\boldsymbol{w}^{k+1}- \boldsymbol{b}\bigr)+\sigma _{k+1}\bigl(\boldsymbol{A} \boldsymbol{x}^{k+1}- \boldsymbol{b}\bigr) \\ &=\boldsymbol{y}^{k}+\sigma _{k}\boldsymbol{A} \boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr)+\sigma _{k+1} \boldsymbol{u}^{k+1}, \end{aligned}$$

where we have used in the last step the identity (12). Thereby, we obtain line 5 of Algorithm 2. This completes the verification that the sequence \(\{(\boldsymbol{x}^{k},\boldsymbol{w}^{k},\boldsymbol{z}^{k})\}_{k\geq 0}\) generated by Algorithm 1 is equivalent to the just-constructed sequence \(\{(\boldsymbol{x}^{k},\boldsymbol{u}^{k},\boldsymbol{y}^{k})\}_{k\geq 0}\) corresponding to the iterates of Algorithm 2.

Remark 1

The implementation of Algorithm 2 is fully parallelisable, involving a computational architecture with d agents and a single central coordinator. The agents manage the coordinate blocks \(\boldsymbol{x}_{i}\) in a fully decentralised way, using information about the centrally updated dual variable \(\boldsymbol{y}^{k}\) only. A practical implementation of the computational scheme is as follows:

  1. 1.

    Given the current data \((\boldsymbol{x}^{k},\boldsymbol{u}^{k},\boldsymbol{y}^{k})\) the coordinator realises a sampling \(\iota _{k}\).

  2. 2.

    All agents in \(\iota _{k}\) receive the order to update their control variables in parallel, given their current position \(\boldsymbol{x}_{i}^{k}\), the data matrix \(\boldsymbol{A}_{i}\) and the penalty of resource utilisation \(\boldsymbol{y}^{k}\).

  3. 3.

    Once all active agents have executed their computation, they report the vector \(\boldsymbol{A}_{i}(\boldsymbol{x}^{k+1}_{i}-\boldsymbol{x}^{k}_{i})\) to the central coordinator.

  4. 4.

    The central coordinator updates the penalty parameter \(\boldsymbol{y}^{k}\) by executing the dual update in line 5 of the Algorithm 2

Distributed primal-dual methods such as the one described have received enormous attention in the control and machine-learning community; see e.g [28, 33, 3640].

Remark 2

Consider the special case with \(\pi _{i}=1/d\) and \(d=1\), as well as \(\sigma _{k}\equiv \sigma \). Then, Algorithm 2 coincides with the primal-dual method of Chambolle–Pock [23]. In fact, in this case it follows that \(\boldsymbol{u}^{k}=\boldsymbol{A}\boldsymbol{x}^{k}-\boldsymbol{b}\), and \(\boldsymbol{y}^{k+1}=\boldsymbol{y}^{k}+\sigma (\boldsymbol{A}(2\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k})-b)\).

4 Convergence analysis

This section is concerned with the convergence properties of Algorithm 1. We start with a basic descent property of the primal forward–backward step. This will involve the identification of a Lyapunov function to obtain energy-decay estimates in a variable metric. Building on this result, we investigate two important scenarios in isolation: First, we consider the merely convex case, which is obtained when \(\Upsilon =0\). If \(\Upsilon \succ 0\), then accelerated rates in the primal iterates can be obtained. This, however, requires the derivation of a suitable step-size policy that exploits strong convexity for boosting the performance of the method. Understanding the mechanics of this step-size regime involves a delicate analysis of the thus-constructed step-size policy, which is relegated to Appendix B.

4.1 Lyapunov function and key estimates

We start the analysis with a small extension of “Property 1” stated in [24] for Bregman proximal gradient algorithms.

Lemma 1

For all \(k\geq 0\) and \(\boldsymbol{x}\in \mathbb{R}^{m}\), define

$$ \zeta ^{k}(\boldsymbol{x})=\Phi \bigl( \boldsymbol{x}^{k}\bigr)+\bigl\langle {\nabla \Phi \bigl( \boldsymbol{x}^{k}\bigr)},{ \boldsymbol{x}-\boldsymbol{x}^{k}} \bigr\rangle +S_{k}\bigl\langle {\nabla h\bigl(\boldsymbol{z}^{k} \bigr)},{\boldsymbol{x}- \boldsymbol{z}^{k}}\bigr\rangle + \frac{1}{2} \bigl\Vert \boldsymbol{x}-\boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}}. $$
(14)

Then, for all \(\boldsymbol{x}\in \mathbb{R}^{m}\) it holds true that

$$ R\bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)+\zeta ^{k}\bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)\leq R( \boldsymbol{x})+ \zeta ^{k}(\boldsymbol{x})-\frac{1}{2} \bigl\Vert \boldsymbol{x}-\hat{\boldsymbol{x}}^{k+1} \bigr\Vert ^{2}_{ \boldsymbol{Q}^{k}+\Upsilon}. $$
(15)

Proof

Collect the forward–backward operator in coordinates \(\operatorname{\mathsf{T}}^{k}(\boldsymbol{x}):=[\operatorname{\mathsf{T}}^{k}_{1}(\boldsymbol{x}_{1});\ldots ;\operatorname{\mathsf{T}}^{k}_{d}( \boldsymbol{x}_{d})]\), and set \(\hat{\boldsymbol{x}}^{k+1}=\operatorname{\mathsf{T}}^{k}(\boldsymbol{x}^{k})\). From the definition of the forward–backward operator (11), we see that

$$ 0\in \partial r_{i}\bigl(\hat{\boldsymbol{x}}^{k+1} \bigr)+\nabla _{i}\zeta _{k}\bigl( \hat{ \boldsymbol{x}}^{k+1}\bigr)\quad \forall i\in [d]. $$

Hence, for all \(\boldsymbol{x}\in \mathbb{R}^{m}\) and \(i\in [d]\), we obtain from Eq. (3)

$$ r_{i}(\boldsymbol{x}_{i})\geq r_{i}\bigl( \hat{\boldsymbol{x}}^{k+1}_{i}\bigr)+\bigl\langle {-\nabla _{i} \zeta ^{k}\bigl(\hat{\boldsymbol{x}}^{k+1} \bigr)},{\boldsymbol{x}_{i}-\hat{\boldsymbol{x}}^{k+1}_{i}} \bigr\rangle +\frac{1}{2} \bigl\Vert \boldsymbol{x}_{i}- \hat{\boldsymbol{x}}^{k+1}_{i} \bigr\Vert ^{2}_{ \Upsilon _{i}}. $$

Summing over all blocks \(i\in [d]\), it follows that

$$ R(\boldsymbol{x})\geq R\bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)+\bigl\langle {-\nabla \zeta ^{k}\bigl( \hat{\boldsymbol{x}}^{k+1} \bigr)},{\boldsymbol{x}-\hat{\boldsymbol{x}}^{k+1}}\bigr\rangle + \frac{1}{2} \bigl\Vert \boldsymbol{x}-\hat{\boldsymbol{x}}^{k+1} \bigr\Vert ^{2}_{\Upsilon}. $$

Furthermore, it is easy to see that \(\boldsymbol{x}\mapsto \zeta ^{k}(\boldsymbol{x})\) is 1-strongly convex in the norm \(\Vert \cdot \Vert _{\boldsymbol{Q}^{k}}\). Hence,

$$ \zeta ^{k}(\boldsymbol{x})\geq \zeta ^{k}\bigl(\hat{ \boldsymbol{x}}^{k+1}\bigr)+\bigl\langle {\nabla \zeta ^{k} \bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)},{\boldsymbol{x}-\hat{ \boldsymbol{x}}^{k+1}}\bigr\rangle + \frac{1}{2} \bigl\Vert \boldsymbol{x}-\hat{\boldsymbol{x}}^{k+1} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}}. $$

Adding these two inequalities, and rearranging, gives the claimed result. □

Lemma 11, together with Lemma 12, shows that

$$\begin{aligned} \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{w}^{k} \bigr\Vert ^{2}_{\boldsymbol{A}^{\top} \boldsymbol{A}}&= \frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2} \\ &\stackrel{\text{(A.4)}}{=}\frac{1}{2}\mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}^{k}+ \boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr)-\boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2} \bigr] \\ &\quad {}-\frac{1}{2}\mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}(\boldsymbol{P}\boldsymbol{E}_{k}-\boldsymbol{I}) \bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigr\Vert ^{2} \bigr] \\ &=\frac{1}{2}\mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}^{k}+\boldsymbol{P}\bigl( \boldsymbol{x}^{k+1}-\boldsymbol{x}^{k}\bigr)- \boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2} \bigr]- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert _{\Xi -\boldsymbol{A}^{ \top}\boldsymbol{A}}. \end{aligned}$$

On applying Eqs. (A.5), (A.6), (A.9) and (A.10), as well as the identity \(\theta _{k}=\frac{\sigma _{k}}{S_{k}}\), the above becomes

$$ \begin{aligned} \frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{w}^{k} \bigr) \bigr\Vert ^{2}& \overset{\text{(A.9)}}{=} \frac{S_{k}}{\sigma _{k}}h\bigl(\boldsymbol{w}^{k}\bigr)+ \frac{S_{k}}{S_{k-1}}\mathbb{E}_{k}\bigl[h\bigl(\boldsymbol{x}^{k}+ \boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}- \boldsymbol{x}^{k} \bigr)\bigr)\bigr] \\ &\quad {}-\frac{S^{2}_{k}}{\sigma _{k}S_{k-1}}\mathbb{E}_{k}\bigl[h\bigl( \boldsymbol{w}^{k+1}\bigr)\bigr]- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr\Vert ^{2}_{\Xi -\boldsymbol{A}^{ \top}\boldsymbol{A}} \\ &\overset{\text{(A.10)},\text{(A.6)}}{=} \frac{S_{k}}{\sigma _{k}}h\bigl( \boldsymbol{w}^{k}\bigr)+\frac{S_{k}}{2S_{k-1}} \mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}(\boldsymbol{P}\boldsymbol{E}_{k}- \boldsymbol{I}) \bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr) \bigr\Vert ^{2}\bigr] \\ &\quad {}-\frac{S^{2}_{k}}{\sigma _{k}S_{k-1}}\mathbb{E}_{k}\bigl[h\bigl( \boldsymbol{w}^{k+1}\bigr)\bigr]- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr\Vert ^{2}_{\Xi -\boldsymbol{A}^{ \top}\boldsymbol{A}}+\frac{S_{k}}{S_{k-1}}h\bigl(\hat{ \boldsymbol{x}}^{k+1}\bigr) \\ & \overset{\text{(A.7)}}{=}\frac{S_{k}}{\sigma _{k}}h\bigl( \boldsymbol{w}^{k}\bigr)+ \frac{\sigma _{k}}{2S_{k-1}} \bigl\Vert \hat{ \boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr\Vert ^{2}_{ \Xi -\boldsymbol{A}^{\top}\boldsymbol{A}} \\ &\quad {}- \frac{S^{2}_{k}}{\sigma _{k}S_{k-1}} \mathbb{E}_{k} \bigl[h\bigl(\boldsymbol{w}^{k+1}\bigr)\bigr]+\frac{S_{k}}{S_{k-1}}h\bigl(\hat{\boldsymbol{x}}^{k+1}\bigr). \end{aligned} $$
(16)

Finally, we need a characterisation of the sequence \((\boldsymbol{w}^{k})_{k\geq 0}\), which is quite standard in the analysis of randomised block-coordinate methods [32, 34, 35]. We therefore skip the straightforward proof.

Lemma 2

Let \((\boldsymbol{x}^{k},\boldsymbol{w}^{k})_{k\geq 0}\) be the iterates of Algorithm 1. Then, for all \(k\geq 1\), we have

$$ \boldsymbol{w}^{k}_{i}=\sum _{t=0}^{k}\gamma _{i}^{k,t} \boldsymbol{x}^{t}_{i}, $$
(17)

where for each \(i\in [d]\), the coefficients \((\gamma _{i}^{k,t})_{t=0}^{k}\) are defined recursively by setting \(\gamma _{i}^{0,0}=1\), \(\gamma _{i}^{1,0}=1-\frac{\theta _{0}}{\pi _{i}}\), \(\gamma _{i}^{1,1}=\frac{\theta _{0}}{\pi _{i}}\), and for all \(k\geq 1\)

$$ \gamma ^{k+1,t}_{i}:=\textstyle\begin{cases} (1-\theta _{k})\gamma _{i}^{k,t} & \textit{if }t=0,1,\ldots ,k-1, \\ (1-\theta _{k})\gamma _{i}^{k,k}+\theta _{k}(1-\pi _{i}^{-1}) & \textit{if }t=k, \\ \theta _{k}/\pi _{i} & \textit{if }t=k+1. \end{cases} $$
(18)

Moreover, for all \(k\geq 0\), the following identity holds

$$ \gamma _{i}^{k+1,k}+\gamma ^{k+1,k+1}_{i}=\theta _{k}+(1-\theta _{k}) \gamma _{i}^{k,k}. $$
(19)

Moreover, if \(\theta _{0}\in (0,\min_{i\in [d]}\pi _{i}]\) and \((\theta _{k})_{k\geq 0}\) is a decreasing sequence, then for all \(k\geq 0\) and \(i\in [d]\), the coefficients \((\gamma _{i}^{k,t})_{t=0}^{k}\) are all positive and add up to 1.

Let \(\hat{\boldsymbol{x}}^{k+1}=\operatorname{\mathsf{T}}_{k}(\boldsymbol{x}^{k})\). Using (2), we see that

$$\begin{aligned} \Psi \bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)&=\Phi \bigl(\hat{ \boldsymbol{x}}^{k+1}\bigr)+R\bigl(\hat{\boldsymbol{x}}^{k+1} \bigr) \\ &\leq R\bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)+\Phi \bigl( \boldsymbol{x}^{k}\bigr)+\bigl\langle {\nabla \Phi \bigl( \boldsymbol{x}^{k}\bigr)},{\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k}}\bigr\rangle +\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr\Vert ^{2}_{\Lambda}. \end{aligned}$$

Using (14), we can continue with the estimate

$$\begin{aligned} \Phi \bigl(\boldsymbol{x}^{k}\bigr)+\bigl\langle {\nabla \Phi \bigl( \boldsymbol{x}^{k}\bigr)},{\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k}}\bigr\rangle =\zeta ^{k}\bigl(\hat{ \boldsymbol{x}}^{k+1}\bigr)-\frac{1}{2} \bigl\Vert \hat{ \boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}}-S_{k}\bigl\langle { \nabla h\bigl( \boldsymbol{z}^{k}\bigr)},{\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{z}^{k}}\bigr\rangle . \end{aligned}$$

Consequently, for \(\boldsymbol{x}^{\ast}\in \mathcal{X}=\arg \min_{\boldsymbol{x}}h(\boldsymbol{x})\) as a reference point, the following bounds are obtained:

$$\begin{aligned} \Psi \bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)&\leq R\bigl(\hat{ \boldsymbol{x}}^{k+1}\bigr)+\zeta ^{k}\bigl( \hat{ \boldsymbol{x}}^{k+1}\bigr)-S_{k}\bigl\langle {\nabla h \bigl(\boldsymbol{z}^{k}\bigr)},{\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{z}^{k}}\bigr\rangle -\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr\Vert ^{2}_{ \boldsymbol{Q}^{k}-\Lambda} \\ &\overset{\text{(15)}}{\leq }R\bigl(\boldsymbol{x}^{*} \bigr)+\zeta ^{k}\bigl( \boldsymbol{x}^{*}\bigr)- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{ \boldsymbol{Q}^{k}}-S_{k} \bigl\langle {\nabla h\bigl(\boldsymbol{z}^{k}\bigr)},{\hat{ \boldsymbol{x}}^{k+1}- \boldsymbol{z}^{k}}\bigr\rangle \\ &\quad {}-\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}- \Lambda} \\ &=\Psi \bigl(\boldsymbol{x}^{*}\bigr)-\Phi \bigl( \boldsymbol{x}^{*}\bigr)+\zeta ^{k}\bigl( \boldsymbol{x}^{*}\bigr)-S_{k} \bigl\langle {\nabla h \bigl(\boldsymbol{z}^{k}\bigr)},{\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{z}^{k}}\bigr\rangle \\ &\quad {}-\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}- \Lambda} \\ &\overset{\text{(14)}}{=}\Psi \bigl(\boldsymbol{x}^{*} \bigr)-\Phi \bigl(\boldsymbol{x}^{*}\bigr)+ \Phi \bigl( \boldsymbol{x}^{k}\bigr)+\bigl\langle {\nabla \Phi \bigl( \boldsymbol{x}^{k}\bigr)},{\boldsymbol{x}^{*}- \boldsymbol{x}^{k}}\bigr\rangle +S_{k}\bigl\langle { \nabla h\bigl(\boldsymbol{z}^{k}\bigr)},{\boldsymbol{x}^{*}- \hat{\boldsymbol{x}}^{k+1}}\bigr\rangle \\ &\quad {}+\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}+ \Upsilon}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}- \Lambda} \\ &\overset{\text{(2)}}{=}\Psi \bigl(\boldsymbol{x}^{*} \bigr)+S_{k}\bigl\langle { \nabla h\bigl(\boldsymbol{z}^{k} \bigr)},{\boldsymbol{x}^{*}-\hat{\boldsymbol{x}}^{k+1}} \bigr\rangle + \frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}+\Upsilon} \\ &\quad {}-\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}- \Lambda}. \end{aligned}$$

Via Eq. (A.8), we obtain

$$ S_{k}\bigl\langle {\nabla h\bigl(\boldsymbol{z}^{k} \bigr)},{\boldsymbol{x}^{*}-\hat{\boldsymbol{x}}^{k+1}} \bigr\rangle =S_{k-1}\bigl\langle {\nabla h\bigl(\boldsymbol{w}^{k} \bigr)},{\boldsymbol{x}^{*}- \hat{\boldsymbol{x}}^{k+1}} \bigr\rangle +\sigma _{k}\bigl\langle {\nabla h\bigl( \boldsymbol{x}^{k}\bigr)},{ \boldsymbol{x}^{*}-\hat{ \boldsymbol{x}}^{k+1}}\bigr\rangle . $$

Next, we apply identity (5) to each inner product separately, in order to conclude that

$$\begin{aligned}& \bigl\langle {\nabla h\bigl(\boldsymbol{w}^{k}\bigr)},{ \boldsymbol{x}^{*}-\hat{\boldsymbol{x}}^{k+1}}\bigr\rangle =h\bigl( \boldsymbol{x}^{\ast}\bigr)-h\bigl(\hat{\boldsymbol{x}}^{k+1} \bigr)-\frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl( \boldsymbol{x}^{*}-\boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2}+\frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{ \boldsymbol{x}}^{k+1}-\boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2}, \\& \bigl\langle {\nabla h\bigl(\boldsymbol{x}^{k}\bigr)},{ \boldsymbol{x}^{*}-\hat{\boldsymbol{x}}^{k+1}}\bigr\rangle =h\bigl( \boldsymbol{x}^{\ast}\bigr)-h\bigl(\hat{\boldsymbol{x}}^{k+1} \bigr)-\frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl( \boldsymbol{x}^{*}-\boldsymbol{x}^{k}\bigr) \bigr\Vert ^{2}+\frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{ \boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigr\Vert ^{2}. \end{aligned}$$

Combined with the previous display, this shows for \(\boldsymbol{x}^{\ast}\in \arg \min_{\boldsymbol{x}\in \mathbb{R}^{m}}h(\boldsymbol{x})\),

$$\begin{aligned} \Psi \bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)-\Psi \bigl( \boldsymbol{x}^{*}\bigr)&\leq -S_{k}\bigl[h\bigl( \hat{ \boldsymbol{x}}^{k+1}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr]-\frac{S_{k-1}}{2} \bigl\Vert \boldsymbol{A}\bigl( \boldsymbol{x}^{*}-\boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2} \\ &\quad {}+\frac{S_{k-1}}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{ \boldsymbol{x}}^{k+1}-\boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2}-\frac{\sigma _{k}}{2} \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}^{*}- \boldsymbol{x}^{k}\bigr) \bigr\Vert ^{2}+ \frac{\sigma _{k}}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k}\bigr) \bigr\Vert ^{2} \\ &\quad {}+\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}+ \Upsilon}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}- \Lambda} \\ &=- \bigl(S_{k}h\bigl(\hat{\boldsymbol{x}}^{k+1} \bigr)+S_{k-1}h\bigl(\boldsymbol{w}^{k}\bigr)+\sigma _{k}h\bigl( \boldsymbol{x}^{k}\bigr)-2S_{k}h \bigl(\boldsymbol{x}^{*}\bigr) \bigr) \\ &\quad {}+\frac{S_{k-1}}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2}+\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}} \\ &\quad {}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}+ \Upsilon}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}- \Lambda -\sigma _{k}\boldsymbol{A}^{\top}\boldsymbol{A}}. \end{aligned}$$

Finally, applying identity (16), one sees that

$$\begin{aligned} &{-} \bigl(S_{k}h\bigl(\hat{\boldsymbol{x}}^{k+1} \bigr)+S_{k-1}h\bigl(\boldsymbol{w}^{k}\bigr)+\sigma _{k}h\bigl( \boldsymbol{x}^{k}\bigr)-2S_{k}h \bigl(\boldsymbol{x}^{*}\bigr) \bigr)+\frac{S_{k-1}}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2} \\ &\quad =- \bigl(S_{k}h\bigl(\hat{\boldsymbol{x}}^{k+1} \bigr)+S_{k-1}h\bigl(\boldsymbol{w}^{k}\bigr)+\sigma _{k}h\bigl( \boldsymbol{x}^{k}\bigr)-2S_{k}h \bigl(\boldsymbol{x}^{*}\bigr) \bigr) \\ &\quad \quad {}+S_{k-1} \biggl(\frac{S_{k}}{\sigma _{k}}h\bigl(\boldsymbol{w}^{k} \bigr)+ \frac{\sigma _{k}}{2S_{k-1}} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{ \Xi -\boldsymbol{A}^{\top}\boldsymbol{A}} \\ &\quad \quad {}- \frac{S^{2}_{k}}{\sigma _{k}S_{k-1}} \mathbb{E}_{k}\bigl[h\bigl( \boldsymbol{w}^{k+1}\bigr)\bigr]+\frac{S_{k}}{S_{k-1}}h\bigl(\hat{ \boldsymbol{x}}^{k+1}\bigr) \biggr) \\ &\quad =\frac{S^{2}_{k-1}}{\sigma _{k}}h\bigl(\boldsymbol{w}^{k}\bigr)- \frac{S^{2}_{k}}{\sigma _{k}}\mathbb{E}_{k}\bigl[h\bigl(\boldsymbol{w}^{k+1} \bigr)\bigr]+2S_{k}h\bigl( \boldsymbol{x}^{*}\bigr)- \sigma _{k}h\bigl(\boldsymbol{x}^{k}\bigr)+ \frac{\sigma _{k}}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\Xi -\boldsymbol{A}^{\top}\boldsymbol{A}} \\ &\quad =\frac{S^{2}_{k-1}}{\sigma _{k}}\bigl(h\bigl(\boldsymbol{w}^{k}\bigr)-h \bigl(\boldsymbol{x}^{*}\bigr)\bigr)- \frac{S^{2}_{k}}{\sigma _{k}} \mathbb{E}_{k}\bigl[h\bigl(\boldsymbol{w}^{k+1}\bigr)-h \bigl(\boldsymbol{x}^{*}\bigr)\bigr]- \sigma _{k}\bigl(h \bigl(\boldsymbol{x}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr) \\ &\quad \quad {}+\frac{\sigma _{k}}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\Xi -\boldsymbol{A}^{\top}\boldsymbol{A}}, \end{aligned}$$

where the last step uses the identity \(2S_{k}=\frac{S^{2}_{k}}{\sigma _{k}}-\frac{S^{2}_{k-1}}{\sigma _{k}}+ \sigma _{k}\). Inserting this expression into the penultimate display, we arrive at

$$ \begin{aligned} \Psi \bigl(\hat{ \boldsymbol{x}}^{k+1}\bigr)-\Psi \bigl(\boldsymbol{x}^{*} \bigr)&\leq \frac{S^{2}_{k-1}}{\sigma _{k}}\bigl(h\bigl(\boldsymbol{w}^{k} \bigr)-h\bigl(\boldsymbol{x}^{*}\bigr)\bigr)- \frac{S^{2}_{k}}{\sigma _{k}} \mathbb{E}_{k}\bigl[h\bigl(\boldsymbol{w}^{k+1}\bigr)-h \bigl(\boldsymbol{x}^{*}\bigr)\bigr] \\ &\quad {}- \sigma _{k}\bigl(h \bigl(\boldsymbol{x}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr) \\ &\quad {}+\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}+ \Upsilon} \\ &\quad {}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}- \Lambda -\sigma _{k}\Xi}. \end{aligned} $$
(20)

Define the vector-valued function \(\vec{\psi}(\boldsymbol{x}):=[\phi _{1}(\boldsymbol{x}_{1})+r_{1}(\boldsymbol{x}_{1}),\ldots , \phi _{d}(\boldsymbol{x}_{d})+r_{d}(\boldsymbol{x}_{d})]^{\top}\in \mathbb{R}^{d}\). Let

$$ \hat{\Psi}_{k}:=\sum_{t=0}^{k} \sum_{i=1}^{d}\gamma _{i}^{k,t} \psi _{i}\bigl( \boldsymbol{x}^{t}_{i} \bigr)=\sum_{t=0}^{k}\gamma ^{k,t}\cdot \vec{\psi}\bigl(\boldsymbol{x}^{t}\bigr), $$

where \(\gamma ^{k,t}\cdot \vec{\psi}(\boldsymbol{x}^{t}):=\sum_{i=1}^{d}\gamma _{i}^{k,t} \psi _{i}(\boldsymbol{x}^{t}_{i})\). Thanks to Lemma 2, we have \(\hat{\Psi}_{k}\geq \Psi (\boldsymbol{w}^{k})\) for all \(k\geq 0\).

Lemma 3

Let \(\boldsymbol{M}:=\operatorname{\mathtt{blkdiag}}[\boldsymbol{M}_{1};\ldots ;\boldsymbol{M}_{d}]\) with \(\boldsymbol{M}_{i}\in \mathbb{S}^{m_{i}}_{++}\). We have

$$\begin{aligned}& \mathbb{E}_{k}\bigl[ \bigl\Vert \boldsymbol{x}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{P}\boldsymbol{M}} \bigr] = \bigl\Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{M}}+ \bigl\Vert \boldsymbol{x}^{k}-\boldsymbol{x}^{*} \bigr\Vert ^{2}_{(\boldsymbol{P}-\boldsymbol{I})\boldsymbol{M}} , \end{aligned}$$
(21)
$$\begin{aligned}& \mathbb{E}_{k}[\hat{\Psi}_{k+1}] =(1-\theta _{k})\hat{\Psi}_{k}+ \theta _{k}\Psi \bigl(\hat{\boldsymbol{x}}^{k+1}\bigr). \end{aligned}$$
(22)

Proof

Equation (21) can be easily seen by observing

$$\begin{aligned} \mathbb{E}_{k}\bigl[ \bigl\Vert \boldsymbol{x}^{k+1}_{i}- \boldsymbol{x}^{*}_{i} \bigr\Vert ^{2}_{ \frac{1}{\pi _{i}}\boldsymbol{M}_{i}} \bigr]&=\pi _{i} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}_{i}- \boldsymbol{x}^{*}_{i} \bigr\Vert ^{2}_{\frac{1}{\pi _{i}} \boldsymbol{M}_{i}}+(1- \pi _{i}) \bigl\Vert \boldsymbol{x}^{k}_{i}- \boldsymbol{x}^{\ast}_{i} \bigr\Vert ^{2}_{ \frac{1}{\pi _{i}}\boldsymbol{M}_{i}} \\ &= \bigl\Vert \hat{\boldsymbol{x}}^{k+1}_{i}- \boldsymbol{x}^{*}_{i} \bigr\Vert ^{2}_{\boldsymbol{M}_{i}}+ \bigl(\pi ^{-1}_{i}-1\bigr) \bigl\Vert \boldsymbol{x}^{k}_{i}-\boldsymbol{x}^{*}_{i} \bigr\Vert ^{2}_{\boldsymbol{M}_{i}}. \end{aligned}$$

Summing over all \(i\in \{1,\ldots ,d\}\) gives the result. To prove (22), we first observe

$$ \mathbb{E}_{k}\bigl[\psi _{i}\bigl( \boldsymbol{x}^{k+1}_{i}\bigr)\bigr]=\pi _{i} \psi _{i}\bigl( \hat{\boldsymbol{x}}^{k+1}_{i} \bigr)+(1-\pi _{i})\psi _{i}\bigl(\boldsymbol{x}^{k}_{i} \bigr), $$

so that \(\mathbb{E}_{k}[\vec{\psi}(\boldsymbol{x}^{k+1})]=\boldsymbol{P}^{-1}\vec{\psi}( \hat{\boldsymbol{x}}^{k+1})+(\boldsymbol{I}-\boldsymbol{P}^{-1})\vec{\psi}(\boldsymbol{x}^{k})\). It follows that

$$\begin{aligned} \hat{\Psi}_{k+1}&=\sum_{t=0}^{k-1} \gamma ^{k+1,t}\cdot \vec{\psi}\bigl( \boldsymbol{x}^{t} \bigr)+\gamma ^{k+1,k}\cdot \vec{\psi}\bigl(\boldsymbol{x}^{k} \bigr)+\gamma ^{k+1,k+1} \cdot \vec{\psi}\bigl(\boldsymbol{x}^{k+1} \bigr) \\ &=\sum_{t=0}^{k-1}(1-\theta _{k})\gamma ^{k,t}\cdot \vec{\psi}\bigl( \boldsymbol{x}^{t}\bigr)+ \bigl((1-\theta _{k})\gamma ^{k,k}+\theta _{k}\bigl(1-\pi ^{-1}\bigr) \bigr)\cdot \vec{r}\bigl(\boldsymbol{x}^{k}\bigr)+\theta _{k}\pi ^{-1}\cdot \vec{\psi}\bigl(\boldsymbol{x}^{k+1} \bigr) \\ &=(1-\theta _{k})\hat{\Psi}_{k}+\theta _{k} \bigl(\bigl(1-\pi ^{-1}\bigr)\cdot \vec{\psi}\bigl( \boldsymbol{x}^{k}\bigr)+\pi ^{-1}\cdot \vec{\psi}\bigl( \boldsymbol{x}^{k+1}\bigr) \bigr). \end{aligned}$$

Taking conditional expectations on both sides, it follows that

$$\begin{aligned} \begin{aligned} \mathbb{E}_{k}[\hat{\Psi}_{k+1}]&=(1-\theta _{k})\hat{\Psi}_{k}+ \theta _{k} \bigl( \bigl(1-\pi ^{-1}\bigr)\cdot \vec{\psi}\bigl(\boldsymbol{x}^{k} \bigr)+\Psi \bigl( \hat{\boldsymbol{x}}^{k+1}\bigr)+\bigl(\pi ^{-1}-1\bigr)\cdot \vec{\psi}\bigl(\boldsymbol{x}^{k} \bigr) \bigr) \\ &=(1-\theta _{k})\hat{\Psi}_{k}+\theta _{k}\Psi \bigl(\hat{\boldsymbol{x}}^{k+1}\bigr). \end{aligned} \end{aligned}$$

 □

The next result characterises a suitable Lyapunov function in an adapted variable metric. We set \(\boldsymbol{B}:=\operatorname{\mathtt{blkdiag}}[\boldsymbol{B}_{1};\ldots ;\boldsymbol{B}_{d}]\).

Lemma 4

Assume that \(\theta _{0}\in (0,\min_{i\in [d]}\pi _{i}]\), and that the sequences \((\sigma _{k})_{k\geq 0}\), \((\tau _{k})_{k\geq 0}\) satisfy the matrix inequalities

$$\begin{aligned} &\boldsymbol{P}\boldsymbol{B}\succeq \tau _{k}( \Lambda +\sigma _{k}\Xi ), \end{aligned}$$
(23)
$$\begin{aligned} &\boldsymbol{P}^{2}\boldsymbol{B}+\tau _{k} \boldsymbol{P}\Upsilon \succeq \frac{\tau _{k}\sigma _{k+1}}{\tau _{k+1}\sigma _{k}}\bigl(\boldsymbol{P}^{2} \boldsymbol{B}-\tau _{k+1}(\boldsymbol{I}-\boldsymbol{P})\Upsilon \bigr). \end{aligned}$$
(24)

Define the matrix-valued sequence \((\boldsymbol{W}_{k})_{k\geq 0}\subset \mathbb{S}^{m}_{++}\) by \(\boldsymbol{W}_{k}=\frac{\sigma _{k}}{\tau _{k}}\boldsymbol{P}^{2}\boldsymbol{B}+\sigma _{k}( \boldsymbol{P}-\boldsymbol{I})\Upsilon \), and introduce the functions

$$\begin{aligned} &F_{k}:=\hat{\Psi}_{k}+S_{k-1}\bigl(h\bigl( \boldsymbol{w}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*}\bigr) \bigr) \quad \textit{and} \end{aligned}$$
(25)
$$\begin{aligned} &V_{k}(\boldsymbol{x}):=\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}-\boldsymbol{x} \bigr\Vert ^{2}_{\boldsymbol{W}_{k}}+S_{k-1} \bigl(F_{k}- \Psi (\boldsymbol{x})\bigr). \end{aligned}$$
(26)

We have for all \(\boldsymbol{x}^{*}\in \mathcal{X}\),

$$ \mathbb{E}_{k} \bigl[V_{k+1}\bigl( \boldsymbol{x}^{*}\bigr) \bigr]\leq V_{k}\bigl( \boldsymbol{x}^{*}\bigr)- \sigma ^{2}_{k} \bigl(h\bigl(\boldsymbol{x}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr)-\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{ \frac{\sigma _{k}}{\tau _{k}}(\boldsymbol{P}\boldsymbol{B}-\tau _{k}(\sigma _{k}\Xi + \Lambda ))}. $$
(27)

Proof

Using identity (21) with \(\boldsymbol{M}=\boldsymbol{Q}_{k}+\Upsilon \), we obtain

$$\begin{aligned} \mathbb{E}_{k} \biggl[\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k+1}-\boldsymbol{x}^{*} \bigr\Vert ^{2}_{ \boldsymbol{P}\boldsymbol{Q}^{k}+\boldsymbol{P}\Upsilon} \biggr]=\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}+\Upsilon}+ \frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}-\boldsymbol{x}^{*} \bigr\Vert ^{2}_{(\boldsymbol{P}-\boldsymbol{I})(\boldsymbol{Q}^{k}+ \Upsilon )}. \end{aligned}$$

Furthermore, Eq. (22) yields

$$\begin{aligned} \frac{S_{k}}{\sigma _{k}}\mathbb{E}_{k}\bigl[\hat{\Psi}_{k+1}- \Psi \bigl(\boldsymbol{x}^{*}\bigr)\bigr]= \frac{S_{k-1}}{\sigma _{k}} \bigl(\hat{\Psi}_{k}-\Psi \bigl(\boldsymbol{x}^{*}\bigr) \bigr)+ \bigl(\Psi \bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)-\Psi \bigl( \boldsymbol{x}^{*}\bigr) \bigr) \end{aligned}$$

and (20) shows that

$$\begin{aligned} \frac{S^{2}_{k}}{\sigma _{k}} \mathbb{E}_{k}\bigl[h\bigl( \boldsymbol{w}^{k+1}\bigr)-h\bigl(\boldsymbol{x}^{*}\bigr) \bigr]& \leq \frac{S^{2}_{k-1}}{\sigma _{k}}\bigl[h\bigl(\boldsymbol{w}^{k} \bigr)-h\bigl(\boldsymbol{x}^{*}\bigr)\bigr]-\bigl[ \Psi \bigl(\hat{ \boldsymbol{x}}^{k+1}\bigr)-\Psi \bigl(\boldsymbol{x}^{*} \bigr)\bigr] \\ &\quad {}-\sigma _{k}\bigl[h\bigl(\boldsymbol{x}^{k} \bigr)-h\bigl( \boldsymbol{x}^{*}\bigr)\bigr] \\ &\quad {}+\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}+ \Upsilon} \\ &\quad {}- \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}- \Lambda -\sigma _{k}\Xi}. \end{aligned}$$

Adding these three expression together, we obtain the recursion

$$ \begin{aligned} & \mathbb{E}_{k} \biggl[\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k+1}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{P}\boldsymbol{Q}^{k}+\boldsymbol{P}\Upsilon}+ \frac{S_{k}}{\sigma _{k}} \bigl(\hat{\Psi}_{k+1}-\Psi \bigl( \boldsymbol{x}^{*}\bigr)+S_{k}\bigl(h\bigl( \boldsymbol{w}^{k+1}\bigr)-h\bigl(\boldsymbol{x}^{*}\bigr) \bigr) \bigr) \biggr] \\ &\quad \leq \frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{P}\boldsymbol{Q}^{k}+( \boldsymbol{P}-\boldsymbol{I})\Upsilon}+ \frac{S_{k-1}}{\sigma _{k}} \bigl(\hat{\Psi}_{k}- \Psi \bigl( \boldsymbol{x}^{*}\bigr)+S_{k-1}\bigl(h\bigl( \boldsymbol{w}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*}\bigr) \bigr) \bigr) \\ &\quad \quad {}-\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{Q}^{k}-( \Lambda +\sigma _{k}\Xi )}- \sigma _{k}\bigl(h\bigl(\boldsymbol{x}^{k}\bigr)-h\bigl( \boldsymbol{x}^{*}\bigr)\bigr). \end{aligned} $$
(28)

Recall that \(\boldsymbol{Q}^{k}=\frac{1}{\tau _{k}}\boldsymbol{P}\boldsymbol{B}\). Condition (23) guarantees that \(\boldsymbol{Q}^{k}\succeq \Lambda +\sigma _{k}\Xi \). Multiplying both sides of (28) by \(\sigma _{k}\), we obtain

$$\begin{aligned} & \mathbb{E}_{k} \biggl[\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k+1}-\boldsymbol{x}^{*} \bigr\Vert ^{2}_{ \frac{\sigma _{k}}{\tau _{k}}\boldsymbol{P}^{2}\boldsymbol{B}+\sigma _{k}\boldsymbol{P} \Upsilon}+S_{k} \bigl(\hat{ \Psi}_{k+1}-\Psi \bigl(\boldsymbol{x}^{*} \bigr)+S_{k}\bigl(h\bigl( \boldsymbol{w}^{k}\bigr)-h \bigl(\boldsymbol{x}^{*}\bigr)\bigr) \bigr) \biggr] \\ &\quad \leq \frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{ \frac{\sigma _{k}}{\tau _{k}}\boldsymbol{P}^{2}\boldsymbol{B}+\sigma _{k}(\boldsymbol{P}- \boldsymbol{I})\Upsilon}+S_{k-1} \bigl(\hat{\Psi}_{k}-\Psi \bigl(\boldsymbol{x}^{*} \bigr)+S_{k-1}\bigl(h\bigl( \boldsymbol{w}^{k}\bigr)-h \bigl(\boldsymbol{x}^{*}\bigr)\bigr) \bigr) \\ &\quad \quad {}-\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{ \frac{\sigma _{k}}{\tau _{k}}\boldsymbol{P}\boldsymbol{B}-\sigma _{k}(\Lambda + \sigma _{k}\Xi )}- \sigma ^{2}_{k}\bigl(h\bigl(\boldsymbol{x}^{k} \bigr)-h\bigl(\boldsymbol{x}^{*}\bigr)\bigr). \end{aligned}$$

Under condition (24), it follows that

$$ \bigl\Vert \boldsymbol{x}^{k+1}-\boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{W}_{k+1}}\leq \bigl\Vert \boldsymbol{x}^{k+1}-\boldsymbol{x}^{*} \bigr\Vert ^{2}_{\frac{\sigma _{k}}{\tau _{k}} \boldsymbol{P}^{2}\boldsymbol{B}+\sigma _{k}\boldsymbol{P}\Upsilon}. $$

Exploiting this relation in the penultimate display, using definitions (25) and (26), one readily obtains

$$ \mathbb{E}_{k} \bigl[V_{k+1}\bigl(\boldsymbol{x}^{*} \bigr) \bigr]\leq V_{k}\bigl(\boldsymbol{x}^{*}\bigr)- \sigma ^{2}_{k}\bigl(h\bigl(\boldsymbol{x}^{k} \bigr)-h\bigl(\boldsymbol{x}^{*}\bigr)\bigr)-\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr\Vert ^{2}_{ \frac{\sigma _{k}}{\tau _{k}}(\boldsymbol{P}\boldsymbol{B}-\tau _{k}(\sigma _{k}\Xi + \Lambda ))}. $$

 □

4.2 The convex case

Suppose that \(\Upsilon =0\), so that \(\boldsymbol{W}_{k}=\frac{\sigma _{k}}{\tau _{k}}\boldsymbol{P}^{2}\boldsymbol{B}\). In this case, the matrix inequality (24) is satisfied for any sequence \((\sigma _{k})_{k\geq 0}\), \((\tau _{k})_{k\geq 0}\) satisfying \(\frac{\sigma _{k}}{\tau _{k}}\geq \frac{\sigma _{k+1}}{\tau _{k+1}}\). In particular, this can be realised with the policy \(\sigma _{k}=\sigma \tau _{k}\) for all \(k\geq 0\), and some constant \(\sigma \in (0,\min_{i\in [d]}\pi _{i}]\). This implies that \(\theta _{k}=\frac{\sigma}{1+k\sigma}\) for all \(k\geq 0\). This specification satisfies all conditions needed for Lemma 2 to hold. We further set \(\tau _{k}\equiv 1\). It only remains to see if matrix inequality (23) is satisfied. In fact, this condition gives us a restriction on σ, reading as

$$ \boldsymbol{P}\boldsymbol{B}\succeq \sigma \Xi +\Lambda . $$
(29)

We show later in this section that this matrix inequality is achievable. In this regime, we will show that we obtain an \(O(1/k)\) iteration complexity estimate of the averaged sequence \((\boldsymbol{w}^{k})_{k}\), in terms of the objective function value. This extends the results reported in [22, 25, 26] to general random block-coordinate activation schemes. The following Assumptions shall be in place throughout this section:

Assumption 1

Problem (P) admits a saddle point, i.e. a pair \((\boldsymbol{x}^{\ast},\boldsymbol{y}^{\ast})\in \mathbb{R}^{m}\times \mathbb{R}^{m}\) satisfying (8).

Assumption 2

The solution set of problem (P) \(\mathcal{X}^{\ast}\) is nonempty.

Thanks to Eq. (17), we know that \(\boldsymbol{w}^{k}\in \mathtt{Conv}(\boldsymbol{x}^{0},\ldots ,\boldsymbol{x}^{k})\subseteq \mathtt{dom}(r)\). Therefore,

$$\begin{aligned} R\bigl(\boldsymbol{w}^{k}\bigr)-R\bigl(\boldsymbol{x}^{\ast} \bigr)&\geq \bigl\langle {-\nabla \phi \bigl(\boldsymbol{x}^{*}\bigr)- \boldsymbol{A}^{\top }\boldsymbol{A}\boldsymbol{y}^{*}},{ \boldsymbol{w}^{k}-\boldsymbol{x}^{*}}\bigr\rangle \\ &\geq \phi \bigl(\boldsymbol{x}^{*}\bigr)-\phi \bigl( \boldsymbol{w}^{k}\bigr)+\bigl\langle {\boldsymbol{A} \boldsymbol{y}^{*}},{ \boldsymbol{A}\bigl(\boldsymbol{x}^{*}- \boldsymbol{w}^{k}\bigr)}\bigr\rangle . \end{aligned}$$

Therefore,

$$ \Psi \bigl(\boldsymbol{w}^{k}\bigr)-\Psi \bigl( \boldsymbol{x}^{*}\bigr)\geq - \bigl\Vert \boldsymbol{A} \boldsymbol{y}^{*} \bigr\Vert \cdot \bigl\Vert \boldsymbol{A} \bigl(\boldsymbol{w}^{k}-\boldsymbol{x}^{*}\bigr) \bigr\Vert =-\delta \sqrt{h\bigl(\boldsymbol{w}^{k}\bigr)-h\bigl( \boldsymbol{x}^{*}\bigr)}, $$
(30)

where \(\delta :=\sqrt{2} \Vert A\boldsymbol{y}^{*} \Vert \). We assume that \(\tau _{k}\equiv 1\) and \(\sigma _{k}=\sigma >0\). The energy functions (25) and (26) take the form \(F_{k}=\hat{\Psi}_{k}+(1+(k-1)\sigma )(h(\boldsymbol{w}^{k})-h(\boldsymbol{x}^{*}))\) and \(V_{k}(\boldsymbol{x}):=\frac{1}{2} \Vert \boldsymbol{x}^{k}-\boldsymbol{x} \Vert ^{2}_{\boldsymbol{P}^{2} \boldsymbol{B}}+(1+(k-1)\sigma )(F_{k}-\Psi (\boldsymbol{x}))\).

Lemma 5

Suppose that Assumption 1applies. Then, the process \((S_{k-1}(F_{k}-\Psi (\boldsymbol{x}^{*})) )_{k\geq 0}\) is almost surely bounded.

Proof

Note that

$$\begin{aligned} S_{k-1}\bigl(F_{k}-\Psi \bigl(\boldsymbol{x}^{*} \bigr)\bigr)&=S_{k-1}\bigl(\hat{\Psi}_{k}-\Psi \bigl( \boldsymbol{x}^{*}\bigr)\bigr)+S^{2}_{k-1} \bigl(h\bigl( \boldsymbol{w}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr) \\ &\geq S_{k-1}\bigl(\Psi \bigl(\boldsymbol{w}^{k}\bigr)- \Psi \bigl(\boldsymbol{x}^{*}\bigr)\bigr)+S^{2}_{k-1} \bigl(h\bigl( \boldsymbol{w}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr) \\ &\overset{\text{(30)}}{\geq} -\delta S_{k-1}\sqrt{h\bigl( \boldsymbol{w}^{k}\bigr)-h\bigl( \boldsymbol{x}^{*} \bigr)}+S^{2}_{k-1}\bigl(h\bigl(\boldsymbol{w}^{k} \bigr)-h\bigl(\boldsymbol{x}^{*}\bigr)\bigr) \\ &=\frac{S^{2}_{k-1}}{2}\bigl(h\bigl(\boldsymbol{w}^{k}\bigr)-h \bigl(\boldsymbol{x}^{*}\bigr)\bigr) \\ &\quad {}+ \biggl( \frac{S^{2}_{k-1}}{2} \bigl(h\bigl(\boldsymbol{w}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr)-\delta S_{k-1} \sqrt{h\bigl(\boldsymbol{w}^{k} \bigr)-h\bigl(\boldsymbol{x}^{*}\bigr)} \biggr). \end{aligned}$$

The convex function \(t\mapsto \frac{S^{2}}{2}t^{2}-\delta S t\) attains the global minimum at the value \(-\frac{\delta ^{2}}{2}\). Hence,

$$ S_{k-1}\bigl(F_{k}-\Psi \bigl( \boldsymbol{x}^{*}\bigr)\bigr)\geq \frac{S^{2}_{k-1}}{2}\bigl(h\bigl( \boldsymbol{w}^{k}\bigr)-h\bigl( \boldsymbol{x}^{*} \bigr)\bigr)-\frac{\delta ^{2}}{2}\geq -\frac{\delta ^{2}}{2}. $$
(31)

The last inequality uses the fact that \(\boldsymbol{x}^{\ast}\in \mathcal{X}\), so that \(h(\boldsymbol{w}^{k})\geq h(\boldsymbol{x}^{\ast})\). This completes the proof. □

We are now in the position to give the proof of the first main result of this paper.

Theorem 6

Suppose that Assumptions 1and 2apply. Consider the sequence \((\boldsymbol{z}^{k},\boldsymbol{w}^{k},\boldsymbol{x}^{k})_{k\geq 0}\) generated by Algorithm 1 with \(\sigma _{k}\equiv \sigma \) and \(\tau _{k}\equiv 1\) for all \(k\geq 1\). Then, the following hold:

  1. (i)

    \(h(\boldsymbol{w}^{k})-h(\boldsymbol{x}^{*})=O(1/k^{2})\) a.s.

  2. (ii)

    \((\boldsymbol{x}^{k})_{k}\) and \((\boldsymbol{w}^{k})_{k}\) converge a.s. to a random variable with values in \(\mathcal{X}^{\ast}\).

  3. (iii)

    \(\mathopen{\lvert }\Psi (\boldsymbol{w}^{k})-\Psi (\boldsymbol{x}^{*})\mathclose{\rvert }=O(1/k)\) a.s.

Proof

(i) From Lemma 5, it follows that the process \(\mathcal{E}_{k}:=V_{k}(\boldsymbol{x}^{*})+\frac{\delta ^{2}}{2}\) is non-negative. Equation (27) shows that \(\mathcal{E}_{k}\) satisfies the recursion

$$ \mathbb{E}_{k}[\mathcal{E}_{k+1}]\leq \mathcal{E}_{k}-\sigma ^{2}\bigl(h\bigl( \boldsymbol{x}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*}\bigr) \bigr)-\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\sigma (\boldsymbol{P}\boldsymbol{B}-( \sigma \Xi +\Lambda ))}. $$
(32)

Lemma 10 implies that \((\mathcal{E}_{k})_{k\geq 0}\) converges a.s. to a limiting random variable \(\mathcal{E}_{\infty}\in [0,\infty )\) and

$$ \sum_{k=0}^{\infty} \biggl[\sigma ^{2}\bigl(h\bigl(\boldsymbol{x}^{k}\bigr)-h\bigl( \boldsymbol{x}^{*}\bigr)\bigr)+ \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr\Vert ^{2}_{\sigma (\boldsymbol{P} \boldsymbol{B}-(\sigma \Xi +\Lambda ))} \biggr]< \infty \quad \text{a.s.} $$

Since \(h(\boldsymbol{x}^{k})-h(\boldsymbol{x}^{*})\geq 0\), it follows that \(\lim_{k\to \infty}\frac{1}{2} \Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \Vert ^{2}_{ \sigma (\boldsymbol{P}\boldsymbol{B}-(\sigma \Xi +\Lambda ))}=0\). Consequently, the sequence \((\frac{1}{2} \Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \Vert ^{2}_{\sigma ( \boldsymbol{P}\boldsymbol{B}-(\sigma \Xi +\Lambda ))} )_{k\geq 0}\) is bounded. By definition of the energy function \(V_{k}(\boldsymbol{x}^{*})\), the sequence \((\boldsymbol{x}^{k})_{k\geq 0}\) is bounded and (31) yields \(h(\boldsymbol{w}^{k})-h(\boldsymbol{x}^{*})\geq -\frac{\delta ^{2}}{S^{2}_{k-1}}\). Hence, \(h(\boldsymbol{w}^{k})-h(\boldsymbol{x}^{*})=O(1/k^{2})\) a.s. Moreover, it is easy to see that \(\lim_{k\to \infty}\sum_{t=0}^{k}\gamma _{i}^{k,t}=1\) and \(\lim_{k\to \infty}\gamma _{i}^{k,t}=0\) for all \(i\in [d]\). Hence, by Lemma 2 and the Silverman–Toeplitz theorem, the sequence \((\boldsymbol{w}^{k})_{k}\) converges to the same limit point as \((\boldsymbol{x}^{k})_{k}\), a.s.

By definition of the energy function \(V_{k}\), the sequence \((\boldsymbol{x}^{k})_{k\geq 0}\) is a.s. bounded. We show next that all accumulation bounds are contained in the set of solutions \(\mathcal{X}^{\ast}\) of (P).

Let \(\Omega _{1}\) be a set of probability 1 on which the subsequence \((\boldsymbol{x}^{k_{j}}(\omega ))_{j\in \mathbb{N}}\) converges to a limiting random variable \(\bar{\boldsymbol{x}}(\omega )\) for all \(\omega \in \Omega _{1}\). It follows from \(\lim_{j\to \infty} \Vert \hat{\boldsymbol{x}}^{k_{j}}(\omega )-\boldsymbol{x}^{k_{j}}(\omega ) \Vert =0\) on \(\Omega _{1}\) that \((\hat{\boldsymbol{x}}^{k_{j}})_{j\in \mathbb{N}}\) converges to \(\bar{\boldsymbol{x}}\) as well on \(\Omega _{1}\). Furthermore,

$$\begin{aligned} h\bigl(\boldsymbol{z}^{k}\bigr)-h\bigl(\boldsymbol{x}^{\ast} \bigr)&=\theta _{k}\bigl[h\bigl(\boldsymbol{w}^{k} \bigr)-h\bigl(\boldsymbol{x}^{ \ast}\bigr)\bigr]+(1-\theta _{k})\bigl[h\bigl(\boldsymbol{x}^{k}\bigr)-h\bigl( \boldsymbol{x}^{\ast}\bigr)\bigr] \\ &\quad {}-\frac{\theta _{k}(1-\theta _{k})}{2} \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}^{k}- \boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2} \\ &=O\bigl(k^{-2}\bigr) \end{aligned}$$

and for all \(\boldsymbol{x}^{\ast}\in \mathcal{X}\),

$$\begin{aligned} \bigl\langle {\nabla h\bigl(\boldsymbol{z}^{k}\bigr)},{\hat{ \boldsymbol{x}}^{k+1}-\boldsymbol{x}^{\ast }} \bigr\rangle &= \bigl\langle {\boldsymbol{A}^{\top }\boldsymbol{A}\bigl( \boldsymbol{z}^{k}-\boldsymbol{x}^{\ast }\bigr)},{ \hat{ \boldsymbol{x}}^{k+1}-\boldsymbol{x}^{\ast }}\bigr\rangle \\ &\leq \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{z}^{k}- \boldsymbol{x}^{\ast}\bigr) \bigr\Vert \cdot \bigl\Vert \boldsymbol{A}\bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{\ast} \bigr) \bigr\Vert \\ &\leq 2\sqrt{h\bigl(\boldsymbol{z}^{k}\bigr)-h\bigl( \boldsymbol{x}^{\ast}\bigr)}\cdot \sqrt{h\bigl( \hat{ \boldsymbol{x}}^{k+1}\bigr)-h\bigl(\boldsymbol{x}^{\ast} \bigr)}=o\bigl(k^{-1}\bigr). \end{aligned}$$

Hence, \(\lim_{k\to \infty}S_{k}\langle{\nabla h(\boldsymbol{z}^{k})},{\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{\ast }}\rangle =0\) for all \(\boldsymbol{x}^{\ast}\in \mathcal{X}\) a.s. Using Lemma 1 at the point \(\boldsymbol{x}^{\ast}\in \mathcal{X}^{\ast}\subset \mathcal{X}\) shows that

$$\begin{aligned} &R\bigl(\hat{\boldsymbol{x}}^{k_{j+1}}\bigr)+\bigl\langle {\nabla \Phi \bigl(\boldsymbol{x}^{k_{j}}\bigr)},{ \hat{\boldsymbol{x}}^{k_{j+1}}- \boldsymbol{x}^{k_{j}}}\bigr\rangle +S_{k}\bigl\langle { \nabla h\bigl( \boldsymbol{z}^{k_{j}}\bigr)},{\hat{\boldsymbol{x}}^{k_{j+1}}- \boldsymbol{z}^{k_{j}}}\bigr\rangle + \frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{k_{j+1}}-\boldsymbol{x}^{k_{j}} \bigr\Vert ^{2}_{\boldsymbol{Q}} \\ &\quad \leq R\bigl(\boldsymbol{x}^{\ast}\bigr)+\bigl\langle {\nabla \Phi \bigl(\boldsymbol{x}^{k_{j}}\bigr)},{\boldsymbol{x}^{ \ast }- \boldsymbol{x}^{k_{j}}}\bigr\rangle +S_{k}\bigl\langle { \nabla h\bigl(\boldsymbol{z}^{k_{j}}\bigr)},{ \boldsymbol{x}^{\ast }- \boldsymbol{z}^{k_{j}}}\bigr\rangle \\ &\quad \quad {}+\frac{1}{2} \bigl\Vert \hat{\boldsymbol{x}}^{\ast}-\boldsymbol{x}^{k_{j}} \bigr\Vert ^{2}_{\boldsymbol{Q}}-\frac{1}{2} \bigl\Vert \boldsymbol{x}^{\ast}-\hat{\boldsymbol{x}}^{k_{j+1}} \bigr\Vert _{\boldsymbol{Q}}. \end{aligned}$$

Let \(j\to \infty \), and using the lower semi-continuity of the function \(R(\cdot )\), one arrives at the inequality

$$ R(\bar{\boldsymbol{x}})\leq R\bigl(\boldsymbol{x}^{\ast}\bigr)+\bigl\langle {\nabla \Phi ( \bar{\boldsymbol{x}})},{\boldsymbol{x}^{\ast }- \bar{\boldsymbol{x}}}\bigr\rangle . $$

Since \(h(\boldsymbol{x}^{k_{j}})-h(\boldsymbol{x}^{\ast})\to 0\), the limit point \(\bar{\boldsymbol{x}}\) is an element of \(\mathcal{X}\). The convexity of Φ in turn yields \(\Phi (\boldsymbol{x}^{\ast})\geq \Phi (\bar{\boldsymbol{x}})+\langle{\nabla \Phi ( \bar{\boldsymbol{x}})},{\boldsymbol{x}^{\ast }-\bar{\boldsymbol{x}}}\rangle \). We conclude that \(\Psi (\bar{\boldsymbol{x}})\leq \Psi (\boldsymbol{x}^{\ast})\) and \(\bar{\boldsymbol{x}}\in \mathcal{X}\). Therefore, \(\bar{\boldsymbol{x}}\in \mathcal{X}^{\ast}\).

(ii) We next show that cluster points are unique and thereby demonstrate almost sure convergence of the last iterate. We argue by contradiction. Suppose there are converging subsequences \((\boldsymbol{x}^{k})_{k\in \mathcal{K}_{1}}\) and \((\boldsymbol{x}^{k})_{k\in \mathcal{K}_{2}}\) with limit points \(\boldsymbol{x}'\) and \(\boldsymbol{x}''\), respectively. Hence, \(\Psi (\boldsymbol{x}')=\Psi (\boldsymbol{x}'')\equiv \Psi ^{\ast}\). Following the above argument, we see that \(\boldsymbol{x}',\boldsymbol{x}''\in \mathcal{X}^{\ast}\) a.s. The point \(\boldsymbol{x}^{\ast}\) chosen in the previous argument was arbitrary, so we can replace it by \(\boldsymbol{x}'\). Let us simplify notation by setting \(a_{k}:=S_{k}(F_{k}-\Psi ^{\ast})\). Since \(V_{k}(\boldsymbol{x}')\) converges almost surely, we see that

$$\begin{aligned} \lim_{k\to \infty}V_{k}\bigl(\boldsymbol{x}' \bigr)&=\lim_{k\to \infty ,k\in \mathcal{K}_{1}}V_{k}\bigl( \boldsymbol{x}'\bigr)=\lim_{k\to \infty ,k\in \mathcal{K}_{1}} \biggl( \frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}' \bigr\Vert ^{2}_{\boldsymbol{W}}+a_{k} \biggr) \\ &=\lim_{k\to \infty ,k\in \mathcal{K}_{1}}a_{k}. \end{aligned}$$

Similarly,

$$\begin{aligned} \lim_{k\to \infty}V_{k}\bigl(\boldsymbol{x}' \bigr)&=\lim_{k\to \infty ,k\in \mathcal{K}_{2}}V_{k}\bigl( \boldsymbol{x}'\bigr)=\lim_{k\to \infty ,k\in \mathcal{K}_{2}} \biggl( \frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}' \bigr\Vert ^{2}_{\boldsymbol{W}}+a_{k} \biggr) \\ &= \bigl\Vert \boldsymbol{x}''- \boldsymbol{x}' \bigr\Vert _{\boldsymbol{W}}+\lim _{k\to \infty ,k\in \mathcal{K}_{2}}a_{k}. \end{aligned}$$

We conclude that

$$ \lim_{k\to \infty ,k\in \mathcal{K}_{1}}a_{k}= \bigl\Vert \boldsymbol{x}''-\boldsymbol{x}' \bigr\Vert _{\boldsymbol{W}}+\lim_{k\to \infty ,k\in \mathcal{K}_{2}}a_{k}. $$

Repeating the same analysis for \(\boldsymbol{x}''\) instead of \(\boldsymbol{x}'\), we obtain

$$ \lim_{k\to \infty ,k\in \mathcal{K}_{2}}a_{k}= \bigl\Vert \boldsymbol{x}''-\boldsymbol{x}' \bigr\Vert _{\boldsymbol{W}}+\lim_{k\to \infty ,k\in \mathcal{K}_{1}}a_{k}. $$

Combining these two equalities, we see that \(\boldsymbol{x}'=\boldsymbol{x}''\). Therefore, the whole sequence \((\boldsymbol{x}^{k})_{k}\) converges pointwise almost surely to a solution. Lemma 2 yields the same assertion for \((\boldsymbol{w}^{k})_{k}\).

(iii) Taking expectations on both sides of (27) and iterating the thus-obtained expression gives

$$ \mathbb{E} \biggl[\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{W}}+S_{k-1} \bigl( \hat{\Psi}_{k}-\Psi \bigl(\boldsymbol{x}^{*} \bigr)\bigr)+S^{2}_{k-1}\bigl(h\bigl(\boldsymbol{w}^{k} \bigr)-h\bigl(\boldsymbol{x}^{*}\bigr)\bigr) \biggr]\leq \frac{1}{2} \bigl\Vert \boldsymbol{x}^{0}- \boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{W}}. $$

From here, we deduce that

$$ S_{k-1}\mathbb{E}\bigl[\hat{\Psi}_{k}-\Psi \bigl( \boldsymbol{x}^{*}\bigr)\bigr]\leq \frac{1}{2} \bigl\Vert \boldsymbol{x}^{0}-\boldsymbol{x}^{*} \bigr\Vert ^{2}_{\boldsymbol{W}}, $$

which gives \(\hat{\Psi}_{k}-\Psi (\boldsymbol{x}^{*})\leq O(1/k)\) a.s. Since \(\hat{\Psi}_{k}\geq \Psi (\boldsymbol{w}^{k})\), it follows \(\Psi (\boldsymbol{w}^{k})-\Psi (\boldsymbol{x}^{*})\leq O(1/k)\) a.s. To obtain a lower bound, it suffices to apply Eq. (30) and use the results from (i) to obtain

$$ \Psi \bigl(\boldsymbol{w}^{k}\bigr)-\Psi \bigl( \boldsymbol{x}^{\ast}\bigr)\geq -\delta \sqrt{h\bigl( \boldsymbol{w}^{k}\bigr)-h\bigl( \boldsymbol{x}^{\ast} \bigr)}=O(1/k)\quad \text{a.s.} $$

 □

We conclude this section with a concrete example showing that the matrix inequality (29) can be satisfied.

Example 3

Assume that \(\boldsymbol{\Xi}=\operatorname{\mathtt{blkdiag}}(\pi _{1}^{-1}\boldsymbol{A}_{1}^{\top}\boldsymbol{A}_{1},\ldots , \pi _{d}^{-1}\boldsymbol{A}_{d}^{\top}\boldsymbol{A}_{d})\). In this case Eq. (29) can be decomposed to the block-specific condition \(\pi ^{-1}_{i}\boldsymbol{B}_{i}\succeq \Lambda _{i}+\frac{\sigma}{\pi _{i}} \boldsymbol{A}_{i}^{\top}\boldsymbol{A}_{i}\). Let us assume further that \(\Lambda _{i}=L_{i}\boldsymbol{I}_{m_{i}}\), for a scalar \(L_{i}>0\). If we choose \(\boldsymbol{B}_{i}=\pi _{i}(\lambda _{i}+L_{i})\boldsymbol{I}_{m_{i}}\), where \(\lambda _{i}= \Vert \boldsymbol{A}_{i} \Vert _{2}\), then a sufficient condition for satisfying the matrix inequality is \(\lambda _{i}+L_{i}\geq L_{i}+\frac{\sigma}{\pi _{i}}\lambda _{i}\) for all \(i\in [d]\). This inequality can be satisfied by choosing \(\sigma =\min_{i\in [d]} \pi _{i}\), which is the largest value for a given set of activation probabilities, that is compatible with Lemma 2.

4.3 The strongly convex case

In this section we study the performance of Algorithm 1 in the strongly convex regime.

Assumption 3

\(\Upsilon \succ 0\).

The main challenge we need to overcome is to design step-size sequences that obey the matrix inequalities (23) and (24). Let us focus on (23) first and show how it can be restated in a more symmetric way. To that end, define the matrix \(\boldsymbol{B}=\boldsymbol{P}^{-2}\Upsilon \) and \(\boldsymbol{M}^{1/2}:=\boldsymbol{P}^{-1/2}\Upsilon ^{1/2}\), so that \(\boldsymbol{M}=\boldsymbol{P}^{-1}\Upsilon \). In terms of this new parameter matrix, we see that

$$ \boldsymbol{P}^{-1}\Upsilon \succeq \tau _{k}\Lambda +\sigma _{k}\tau _{k} \boldsymbol{\Xi}\quad \iff \quad \frac{1}{\tau _{k}}\boldsymbol{I}\succeq \boldsymbol{M}^{-1/2}\Lambda \boldsymbol{M}^{-1/2}+\sigma _{k}\boldsymbol{M}^{-1/2}\boldsymbol{\Xi} \boldsymbol{M}^{-1/2}. $$
(33)

It is easy to see that Ξ is symmetric. Furthermore, for \(\boldsymbol{u}\in \mathbb{R}^{m}\setminus \{0\}\), we see that \(\boldsymbol{u}^{\top}\boldsymbol{\Xi}\boldsymbol{u}=\mathbb{E}[ \Vert \boldsymbol{A}\boldsymbol{t} \Vert ^{2}]\geq 0\), where \(\boldsymbol{t}:=\boldsymbol{P}\boldsymbol{E}_{k}\boldsymbol{u}\) is a random vector in \(\mathbb{R}^{m}\) with mean u. It follows that \(\boldsymbol{\Xi}\in \mathbb{S}^{m}_{+}\). This suggests to relate the step-size parameters to the spectra of the involved matrices. The rest of this section assumes the following parametric model on the coupling between \(\sigma _{k}\) and \(\tau _{k}\).

Assumption 4

The sequence \((\tau _{k})_{k}\) is non-increasing and positive. They are related by the coupling equation

$$ \sigma _{k}\tau _{k}=\alpha -\beta \tau _{k}\quad \forall k\geq 0, $$
(34)

where α and β are positive numbers.

We note that this coupling relation automatically means that \(\sigma _{k}=\alpha /\tau _{k}-\beta \) is non-decreasing. A sufficient condition to satisfy Eq. (33) is

$$ \frac{1}{\tau _{k}}\geq \lambda _{\max}\bigl(\boldsymbol{M}^{-1/2} \Lambda \boldsymbol{M}^{-1/2}\bigr)+ \sigma _{k}\lambda _{\max}\bigl(\boldsymbol{M}^{-1/2}\Xi \boldsymbol{M}^{-1/2} \bigr). $$

Using the model (34), this gives

$$ \frac{1}{\tau _{k}}\bigl(1-\alpha \lambda _{\max}\bigl( \boldsymbol{M}^{-1/2}\boldsymbol{\Xi} \boldsymbol{M}^{-1/2} \bigr)\bigr) \geq \lambda _{\max}\bigl(\boldsymbol{M}^{-1/2} \Lambda \boldsymbol{M}^{-1/2}\bigr)- \beta \lambda _{\max} \bigl(\boldsymbol{M}^{-1/2}\boldsymbol{\Xi}\boldsymbol{M}^{-1/2} \bigr). $$

Choosing

$$ \alpha :=\lambda _{\max}\bigl(\boldsymbol{M}^{-1/2} \boldsymbol{\Xi}\boldsymbol{M}^{-1/2}\bigr)^{-1}, $$
(35)

we obtain

$$ \beta \geq \frac{\lambda _{\max}(\boldsymbol{M}^{-1/2}\Lambda \boldsymbol{M}^{-1/2})}{\lambda _{\max}(\boldsymbol{M}^{-1/2}\boldsymbol{\Xi}\boldsymbol{M}^{-1/2})}= \alpha \lambda _{\max}\bigl( \boldsymbol{M}^{-1/2}\Lambda \boldsymbol{M}^{-1/2}\bigr). $$

We make the choice

$$ \beta =\alpha \lambda _{\max}\bigl( \boldsymbol{M}^{-1/2}(\Lambda +\Upsilon )\boldsymbol{M}^{-1/2} \bigr), $$
(36)

so that the coupling equation (34) gives rise to a step-size process satisfying matrix inequality (23).

Remark 3

To obtain some intuition behind the derived conditions for α and β, consider the independent sampling case under which the operator Ξ factorises to \(\boldsymbol{\Xi}=\operatorname{\mathtt{blkdiag}}(\frac{1}{\pi _{1}}\boldsymbol{A}_{1}^{\top}\boldsymbol{A}_{1}, \ldots ,\frac{1}{\pi _{d}}\boldsymbol{A}_{d}^{\top}\boldsymbol{A}_{d})\). Further, suppose \(\Lambda =\operatorname{\mathtt{blkdiag}}[L_{1}\boldsymbol{I}_{m_{1}};\ldots ;L_{d}\boldsymbol{I}_{m_{d}}]\). Then, the above choice says that

$$\begin{aligned} &\alpha = \biggl(\max_{i\in [d]}\frac{1}{\mu _{i}\pi _{i}^{2}} \Vert \boldsymbol{A}_{i} \Vert _{2}^{2} \biggr)^{-1} \quad \text{and} \end{aligned}$$
(37)
$$\begin{aligned} &\kappa :=\frac{\beta}{\alpha}=\max_{i\in [d]} \frac{L_{i}+\mu _{i}}{\pi _{i}\mu _{i}}. \end{aligned}$$
(38)

Clearly, κ is related to the condition number of problem (P), and satisfies \(\kappa \geq \frac{1}{\pi _{i}}\) for all \(i\in [d]\). We point out that the explicit expression for κ does not rely on the independent sampling assumption. Only the quantification of the parameter α exploits this special structure.

Let us fix these identified values for α and β, and continue our derivation with matrix inequality (24). Let us choose \(\boldsymbol{B}=\boldsymbol{P}^{-2}\Upsilon =\operatorname{\mathtt{blkdiag}}[\pi _{1}^{2}\mu _{1}\boldsymbol{I}_{m_{1}}; \ldots ;\pi _{d}^{2}\mu _{d}\boldsymbol{I}_{m_{d}}]\). Using the block structure, and the specification of the sequence \((\sigma _{k})_{k}\), it can be verified that (24) reduces to the scalar-valued inequality

$$ \tau ^{2}_{k+1} \bigl[(\alpha -\beta \tau _{k}) (\pi _{i}+\tau _{k})+ \beta (1- \pi _{i})\tau ^{2}_{k} \bigr]-\tau _{k+1}\tau ^{2}_{k}\bigl[ \alpha (1-\pi _{i})-\beta \pi _{i}\bigr]-\tau ^{2}_{k} \alpha \pi _{i}\geq 0. $$
(39)

This defines a quadratic inequality in \(\tau _{k+1}\) of the form \(c_{i,1}(\tau _{k})\tau ^{2}_{k+1}+c_{i,2}(\tau _{k})\tau _{k+1}-c_{i,3}( \tau _{k})\geq 0\), with coefficients

$$\begin{aligned} &c_{i,1}(t):=(\alpha -\beta t) (\pi _{i}+t)+\beta (1- \pi _{i})t^{2}, \\ &c_{i,2}(t):=t^{2}\bigl[\beta \pi _{i}- \alpha (1-\pi _{i})\bigr], \\ &c_{i,3}(t):=t^{2}\alpha \pi _{i}. \end{aligned}$$

This suggests to use the recursive step-size policy

$$ \tau _{k+1}=\max_{i\in [d]} \frac{\frac{\tau ^{2}_{k}}{2}(1/\pi _{i}-\kappa -1)+\tau _{k}\sqrt{ (1+(1/\pi _{i}-\kappa )\frac{\tau _{k}}{2} )^{2}- (\frac{2-\pi _{i}}{\pi _{i}}+2\kappa )\frac{\tau ^{4}_{k}}{4}}}{1+\tau _{k}(1/\pi _{i}-\kappa )-\kappa \tau ^{2}_{k}}. $$
(40)

We study the qualitative properties of the so-produced sequence \((\tau _{k})_{k}\) in detail in Appendix B, under the following assumptions:

Assumption 5

For all \(i\in [d]\) we have \(\Lambda _{i}=L_{i}\boldsymbol{I}_{m_{i}}\).

Assumption 6

The sampling is uniform: \(\pi _{i}\equiv \pi \in (0,1)\) for all \(i\in [d]\).

Uniform sampling is a very common sampling scheme employed in the literature. It contains as special cases the single-coordinate activation scheme (i.e. \(\pi _{i}=\frac{1}{d}\)), and the uniform sampling scheme. In particular, the frequently employed m-nice sampling (cf. Sect. 2.7) is contained in this framework. Under this assumption we prove that if \(\tau _{0}\) is chosen sufficiently small, then the sequence \((\tau _{k})_{k}\) exhibits the same qualitative behaviour as a Nesterov accelerated method [41, 42]. This is summarised in Lemma 7, whose proof is provided in Appendix B.

Lemma 7

Let Assumptions 4, 5and 6hold, and consider the step-size sequences \((\sigma _{k})_{k}\), \((\tau _{k})_{k}\) constructed recursively via Eq. (40), with initial condition \(\tau _{0}\in (0,1/\kappa )\). Then, \((\tau _{k})_{k}\) is monotonically decreasing and satisfies

$$ \tau _{k}\geq \frac{2\tau _{0}}{(1+\kappa -1/\pi )\tau _{0}k+2} \quad \forall k\geq 0. $$

In particular, \(\sigma _{k}=O(k)\) and \(S_{k}=O(k^{2})\).

We therefore can prove accelerated rates for our scheme in the strongly convex case.

Theorem 8

Suppose that Assumptions 16apply. Consider the sequence \((\boldsymbol{z}^{k},\boldsymbol{w}^{k},\boldsymbol{x}^{k})_{k\geq 0}\) generated by Algorithm 1 with the following step-size policy constructed via Eqs. (34), (35) and (36). Then, the following hold:

  1. (i)

    We have \(h(\boldsymbol{x}^{k})-h(\boldsymbol{x}^{*})=o(k^{-2})\) and \(h(\boldsymbol{w}^{k})-h(\boldsymbol{x}^{*})=O(k^{-4})\) a.s.

  2. (ii)

    \((\boldsymbol{x}^{k})_{k}\) and \((\boldsymbol{w}^{k})_{k}\) converge a.s. to the solution \(\mathcal{X}^{\ast}=\{\bar{\boldsymbol{x}}\}\).

  3. (iii)

    \(\mathopen{\lvert }\Psi (\boldsymbol{w}^{k})-\Psi (\boldsymbol{x}^{*})\mathclose{\rvert }=O(k^{-2})\) a.s.

Proof

(i) If a Lagrange multiplier exists then Lemma 5 still applies and guarantees that the function \(V_{k}(\boldsymbol{x}^{\ast})+\frac{\delta ^{2}}{2}\geq 0\) for all \(k\geq 0\), and the recursion (32) applies, which reads in the present case

$$ \mathbb{E}_{k}[\mathcal{E}_{k+1}]\leq \mathcal{E}_{k}-\sigma ^{2}_{k}\bigl(h\bigl( \boldsymbol{x}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr)-\frac{\sigma _{k}}{2\tau _{k}} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{P}^{-1}\Upsilon -\tau _{k}( \sigma _{k}\Xi +\Lambda )}. $$

Since \(\boldsymbol{P}^{-1}\Upsilon -\tau _{k}(\sigma _{k}\Xi +\Lambda )\succeq 0\), we can upper bound the right-hand side of the above display as

$$ \mathbb{E}_{k}[\mathcal{E}_{k+1}]\leq \mathcal{E}_{k}-\sigma ^{2}_{k}\bigl(h\bigl( \boldsymbol{x}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr)-\frac{\sigma _{k}}{2\tau _{k}} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{P}^{-1}\Upsilon}. $$
(41)

The supermartingale convergence theorem (Lemma 10) implies that

$$ \lim_{k\to \infty}\sigma ^{2}_{k}\bigl(h \bigl(\boldsymbol{x}^{k}\bigr)-h\bigl(\boldsymbol{x}^{*} \bigr)\bigr)=0 \quad \text{and}\quad \lim_{k\to \infty} \frac{\sigma _{k}}{2\tau _{k}} \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\boldsymbol{P}^{-1}\Upsilon}=0. $$

Since \(\sigma _{k}=O(k)\), it follows that \(h(\boldsymbol{x}^{k})-h(\boldsymbol{x}^{\ast})=o(k^{-2})\). Furthermore, the sequence \((\boldsymbol{x}^{k})_{k}\) is bounded and Eq. (31) yields \(h(\boldsymbol{w}^{k})-h(\boldsymbol{x}^{\ast})= O(k^{-4})\), since \(S_{k}=O(k^{2})\) as shown in Appendix B. Moreover, \((\boldsymbol{w}^{k})_{k}\) and \((\boldsymbol{x}^{k})_{k}\) share the same limit.

(ii) We can repeat the arguments of Theorem 6, to conclude that accumulation points of \((\hat{\boldsymbol{x}}^{k})_{k}\) and \((\boldsymbol{x}^{k})_{k}\) converge to the same random variable with values in \(\mathcal{X}^{\ast}\). However, since the problem is strongly convex, we must have \(\mathcal{X}^{\ast}=\{\bar{\boldsymbol{x}}\}\) for some \(\bar{\boldsymbol{x}}\in \mathcal{X}\). Hence, the sequences actually converge and the common limit point is \(\bar{\boldsymbol{x}}\) a.s. (iii) We next show convergence rates in terms of the objective function gap. Iterating Eq. (41), it follows that \(\mathbb{E}[V_{k}(\boldsymbol{x}^{\ast})]\leq \frac{1}{2} \Vert \boldsymbol{x}^{0}-\boldsymbol{x}^{\ast} \Vert ^{2}_{\boldsymbol{W}_{0}}\). This is equivalent to

$$ \mathbb{E} \biggl[\frac{1}{2} \bigl\Vert \boldsymbol{x}^{k}- \boldsymbol{x}^{\ast} \bigr\Vert ^{2}_{ \boldsymbol{W}_{k}}+S_{k-1} \bigl(\hat{\Psi}_{k}-\Psi \bigl(\boldsymbol{x}^{\ast}\bigr) \bigr)+S^{2}_{k-1}\bigl(h\bigl( \boldsymbol{w}^{k} \bigr)-h\bigl(\boldsymbol{x}^{\ast}\bigr)\bigr) \biggr]\leq \frac{1}{2} \bigl\Vert \boldsymbol{x}^{0}- \boldsymbol{x}^{\ast} \bigr\Vert ^{2}_{\boldsymbol{W}_{0}}. $$

In particular,

$$ \mathbb{E}\bigl[\hat{\Psi}_{k}-\Psi \bigl(\boldsymbol{x}^{\ast} \bigr)\bigr]\leq \frac{1}{2S_{k-1}} \bigl\Vert \boldsymbol{x}^{0}- \boldsymbol{x}^{\ast} \bigr\Vert ^{2}_{\boldsymbol{W}_{0}}=O \bigl(k^{-2}\bigr). $$

Furthermore, \(\hat{\Psi}_{k}\geq \Psi (\boldsymbol{w}^{k})\), so that via Eq. (30) we obtain

$$\begin{aligned} \mathbb{E}\bigl[\Psi \bigl(\boldsymbol{w}^{k}\bigr)-\Psi \bigl( \boldsymbol{x}^{\ast}\bigr)\bigr]\leq O\bigl(k^{-2}\bigr) \quad \text{and}\quad \mathbb{E}\bigl[\Psi \bigl(\boldsymbol{w}^{k} \bigr)-\Psi \bigl(\boldsymbol{x}^{\ast}\bigr)\bigr]\geq - \delta \mathbb{E} \bigl[\sqrt{h\bigl(\boldsymbol{w}^{k}\bigr)-h\bigl( \boldsymbol{x}^{\ast}\bigr)} \bigr]. \end{aligned}$$

Combining these bounds, it follows that \(\mathbb{E}[\Psi (\boldsymbol{w}^{k})-\Psi (\boldsymbol{x}^{\ast})]=O(k^{-2})\). □

5 Application to power systems

In this section we apply Algorithm 1 to the distributed coordination of an energy grid in an AC-optimal power flow formulation. Specifically, we illustrate how our block-coordinate descent method provides an efficient way to replicate the transmission-level locational marginal prices to the distribution level.Footnote 1

5.1 Power-flow model

Consider a radial network with N buses \(\mathcal{N}=\{1,\dots ,N\}\), excluding the slack node 0. Hence, the distribution network assumes the structure of a tree graph and we identify the transmission lines with the label of the parental node. The network is optimised over a time window \(\mathcal{T}=\{1,\dots ,T\}\). We use \(\boldsymbol{p}_{n}=(p_{n,1},\dots ,p_{n,T})\) and \(\boldsymbol{q}_{n}=(q_{n,1},\dots ,q_{n,T})\) to denote active and reactive power net consumption at bus n at each time point \(t\in \mathcal{T}\). We let \(\operatorname{\mathsf{i}}=\sqrt{-1}\) and write a complex number \(z\in \mathbb{C}\) as \(\text{Re}(z)+\operatorname{\mathsf{i}}\text{Im}(z)\). Thus, \(p_{n,t}<0\) means that there is production of energy at bus n at time t. At the slack node \(n=0\), we assume that power will only be generated and there is no consumption, i.e. \(p_{0,t}\leq 0\). At node n, we denote by \(f_{n}=(f_{n,t})_{t\in \mathcal{T}}\) and \(g_{n}=(g_{n,t})_{t\in \mathcal{T}}\) the real and reactive power flows, so that \(f_{n}+\operatorname{\mathsf{i}}g_{n}\) is the complex power injected into node n. By Ohm’s law, we have \(g_{n}=\text{Im}(V_{n}\bar{I}_{n})\) and \(f_{n}=\text{Re}(V_{n}\bar{I}_{n})\). We let \(\ell _{n}=(\ell _{n,t})_{t\in \mathcal{T}}=\mathopen{\lvert }V_{n}\mathclose{\rvert }\) be the squared current magnitude, \(R_{n}\) and \(X_{n}\) denote the series resistance and reactance, respectively. Hence, \(z_{n}=R_{n}+\operatorname{\mathsf{i}}X_{n}\) is the series impedance, and \(y_{n}=z_{n}^{-1}\) the shunt admittance.

The load flow equations using the BFM [44, 45], without explicit consideration of transmission tap ratios, are as follows:

f n m : m = n ( f m R m m )+ p n + G n v n =0, [ y n p ] ,
(42a)
g n m : m = n ( g m R m m )+ q n B n v n =0, [ y n q ] ,
(42b)
v n 2( R n f n + X n g n )+ ( R n 2 + X n 2 ) n = v n ,
(42c)
f n , t 2 + g n , t 2 v n , t n , t ,tT,
(42d)
f n , t 2 + g n , t 2 S n 2 ,tT,
(42e)
( f n , t R n n , t ) 2 + ( g n , t X n n , t ) 2 S n 2 ,tT,
(42f)
V _ n v n , t V ¯ n ,tT,
(42g)

where

  • \(\boldsymbol{v}_{n}=(v_{n,1},\dots ,v_{n,T})\) and \(\boldsymbol{v}_{n_{-}}\) are the squared voltage magnitudes at buses n and \(n_{-}\),

  • \(\boldsymbol{\ell }_{n}\) is the squared current magnitude on branch \((n,n_{-})\),

  • \(\boldsymbol{f}_{n}\) and \(\boldsymbol{g}_{n}\) are the active and the reactive parts of the power flow over line \((n,n_{-})\),

  • \(R_{n}\) and \(X_{n}\) are the resistance and the reactance of branch \((n,n_{-})\),

  • \(G_{n}\) and \(B_{n}\) are the line conductance and susceptance at n.

Equations (42a) and (42b) are the active and reactive flow-conservation equations, (42c) is an expression of Ohm’s law for the branch \((n,n_{-})\), and (42d) is a SOCP relaxation of the definition of the power flow, which implies \(f_{n,t}^{2} +g_{n,t}^{2} = v_{n,t} \ell _{n,t}\). There exist sufficient conditions under which the optimisation problem subject to (42a)–(42g) remains exact, see [4, 30, 46]. Equations (42e) and (42f) are limitations on the squared power-flow magnitude on \((n,n_{-})\), and (42g) gives lower and upper bounds on the voltage at n. For the coupling flow-conservation laws, dual variables are attached, which are the DLMPs corresponding to active and reactive power.

For later reference, we point out that the network flow constraints (42a) and (42b) can be compactly summarised as \(\boldsymbol{A}_{0}\boldsymbol{x}_{0}+\sum_{a}\boldsymbol{A}_{a}\boldsymbol{x}_{a}=\boldsymbol{b}\) for suitably defined matrices \(\boldsymbol{A}_{0}\), \(\boldsymbol{A}_{a}\) and the right-hand side vector b. However, for our computational scheme to work, we do not need to assume that the linear constraint holds exactly. Instead, we assume that the linear relation \(\boldsymbol{A}_{0}\boldsymbol{x}_{0}+\sum_{a}\boldsymbol{A}_{a}\boldsymbol{x}_{a}=\boldsymbol{b}\) holds inexactly. Specifically, motivated by data-driven approaches for power-flow models [47], we assume that there exists a random variable ξ with bounded support such that

$$ \boldsymbol{A}_{0}\boldsymbol{x}_{0}+\sum _{a}\boldsymbol{A}_{a}\boldsymbol{x}_{a}= \boldsymbol{b}+\xi ,\quad \text{and}\quad \frac{1}{2} \biggl\Vert \boldsymbol{A}_{0}\boldsymbol{x}_{0}+\sum _{a}\boldsymbol{A}_{a}\boldsymbol{x}_{a}- \boldsymbol{b} \biggr\Vert ^{2} \leq \delta . $$

In such inexact regimes, conventional deterministic OPF solvers are not applicable, whereas Algorithm 3 is designed to handle such scenarios.

Algorithm 3
figure c

Privacy-preserving DLMP solver (PPDLMP)

5.2 Load aggregators

The set of buses \(\mathcal{N}\) is partitioned into a collection \(( \mathcal{N}_{a})_{a\in \mathcal{A}}\) of subsets, such that each node subset \(\mathcal{N}_{a}\) is managed by a Load Aggregator \(a\in \mathcal{A}\). Each LA controls the flexible net power consumption (\(p_{n,t}\)) and generation at each node \(n\in \mathcal{N}_{a}\), given at time t by

$$ p_{n,t} = p^{\operatorname{c}}_{n,t} -p^{\operatorname{p}}_{n,t} ,\qquad q_{n,t} = q^{\operatorname{c}}_{n,t} -q^{\operatorname{p}}_{n,t} , $$
(43a)

for all \(n\in \mathcal{N}\) and \(t\in \mathcal{T}\). \(p^{\operatorname{c}}_{n,t}\geq 0\) is the consumption part and \(p^{\operatorname{p}}_{n,t}\geq 0\) is the production part of the power profile. Power consumption and production at the nodes are made flexible by the presence of deferrable loads (electric vehicles, water heaters) and Distributed Energy Resources (DERs). The consumption at each node \(n\in \mathcal{N}\) must satisfy a global energy demand \(E_{n}\) over the full time window,

$$ \sum_{t\in \mathcal{T}}p^{\operatorname{c}}_{n,t} \geq E_{n} , \quad \forall n\in \mathcal{N}. $$
(43b)

Consumption and production are also constrained by power bounds and active to reactive power ratio:

P _ n , t p n , t c P n , t ,nN,tT,
(43c)
q n , t c = τ n c p n , t c ,nN,tT,
(43d)
0 p n , t p P ,nN,tT,
(43e)
ρ _ n , t p p n , t p q n , t p ρ n , t p p n , t p ,nN,tT.
(43f)

Constraints (43a)–(43f) define the feasible set \(\mathcal{X}_{a}\) of LA decisions, containing vectors \(\boldsymbol{x}_{a}=(\boldsymbol{p}_{n},\boldsymbol{q}_{n})_{n\in \mathcal{N}_{a}}\). Both, consumption and production, must be scheduled by the LA, taking into account the current spot-market prices, and other specific local factors characterising the private objectives of the LA. Formally, there is a convex cost function \(\phi _{a}(\boldsymbol{x}_{a})\) that the LA would like to unilaterally minimise, subject to private feasibility \(\boldsymbol{x}_{a}\in \mathcal{X}_{a}\).

5.3 The distribution-system operator

In order to guarantee stability of the distribution network, the Distribution System Operator (DSO) takes the individual aggregators’ decisions into account and adjusts the power flows so that the flow-conservation constraints (42a) and (42b), together with the SOCP constraints (42c)–(42g), are satisfied. Let \(\boldsymbol{x}_{0}=(\boldsymbol{p}_{0},\boldsymbol{q}_{0},\boldsymbol{f},\boldsymbol{g},\boldsymbol{v},\boldsymbol{\ell })\) denote the vector of the variables controlled by the DSO, and define the DSO’s feasible set \(\mathcal{X}_{0}= \{ \boldsymbol{x}_{0}\vert \text{(42c)--(42g)} \text{ hold for }n\in \mathcal{N}\} \). Then, the set of DSO decision variables inducing a physically meaningful network flow for a given tuple of LA decisions \(\boldsymbol{x}_{\mathcal{A}}\) is described as \(\mathcal{F}(\boldsymbol{x}_{\mathcal{A}})=\{\boldsymbol{x}_{0}\in \mathcal{X}_{0} \vert \text{(42a)--(42b)} \text{ hold for }\boldsymbol{x}_{\mathcal{A}}\} \). Denoting the DSO cost function \(\phi _{0}(\boldsymbol{x}_{0})\), we arrive at the DSO’s decision problem

$$ \Psi (\boldsymbol{x}_{\mathcal{A}})=\min \bigl\{ \phi _{0}( \boldsymbol{x}_{0})\vert \boldsymbol{x}_{0} \in \mathcal{F}(\boldsymbol{x}_{\mathcal{A}})\bigr\} . $$

This represents the smallest costs to the DSO, given the profile of flexible net consumption and generation at each affiliated node \(n\in \mathcal{N}_{a}\).

5.4 A privacy-preserving DLMP solver

The privacy-preserving DLMP solver (PPDLMP), described in Algorithm 3, asks the DSO to adjust DLMPs based on the prevailing plans reported by the LAs. Once the price update is completed, a single LA is selected at random to adapt the power profile within the subnetwork they manage. The local update of the LA results in the bid vector \(w^{k}\), which will be fed into the DSO final computational step to perform dispatch.

From a practical point of view, it is important to point out that, while executing PPDLMP, the bus-specific data (like cost function, power profiles, etc.) remain private information. This applies equally to the DSO and the LA. Coordination of the system-wide behaviour is achieved via exchanging information about dual variables only, describing the DLMPs and the expressed bids of the LAs.

5.5 Convergence of PPDLMP

We deduce the convergence of PPDLMP from the analysis of Algorithm 1. In order to recover the OPF problem, we identify each function \(\phi _{i}\) with a cost function of the DSO or LA, and \(r_{i}\) is an indicator function of the feasible set \(\mathcal{X}_{a}\) and \(\mathcal{X}_{0}\), respectively. The sampling technology is an example of a uniform sampling of pairs involving the DSO and one aggregator. Specifically, the set of realisations of the sampling is \(\Sigma = \{\{0,a\}:a\in \{1,\ldots ,|\mathcal{A}|\}\equiv \{1, \ldots ,p\} \}\). Each pair is realised with the uniform probability \((\Pi )_{0,a}=\frac{1}{p}\). The marginal distribution of the sampling technology is thus given by \(\pi _{0}=1\) and \(\pi _{a}=1/p\) for \(a\in \{1,\ldots ,p\}\). The optimisation problem is characterised by \(d=p+1\) blocks and weighting matrix \(\boldsymbol{P}=\operatorname{\mathtt{blkdiag}}[\boldsymbol{I}_{m_{0}};\ldots ; p\boldsymbol{I}_{m_{d}}]\), where \(m_{0}\) is the dimension of the feasible set of the DSO, and \(m_{a}\) is the dimension of the feasible set of aggregator \(a\in \mathcal{A}\). Now, define \(R(\boldsymbol{x})=r_{0}(\boldsymbol{x}_{0})+\sum_{a\in \mathcal{A}}r_{a}(\boldsymbol{x}_{a})\), where \(r_{0}(\boldsymbol{x}_{0})=\delta _{\mathcal{X}_{0}}(\boldsymbol{x}_{0})\) and \(r_{a}(\boldsymbol{x}_{a})=\delta _{\mathcal{X}_{a}}(\boldsymbol{x}_{a})\), in which \(\delta _{C}\) is the indicator function of a set C. We now show that Algorithm 3 is a special instantiation of Algorithm 2, which in turn we know to be equivalent to Algorithm 1.

If the load aggregator a is chosen at step k, Line 5 in Algorithm 2 becomes

$$\begin{aligned} \boldsymbol{y}^{k+1}&=\boldsymbol{y}^{k} + \sigma \boldsymbol{A}\boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}- \boldsymbol{x}^{k}\bigr) + \sigma \boldsymbol{u}^{k+1} \\ &=\boldsymbol{y}^{k} + \sigma \boldsymbol{A}_{0} \bigl(\boldsymbol{x}_{0}^{k+1} -\boldsymbol{x}^{k}_{0} \bigr)+ \sigma p\boldsymbol{A}_{a}\bigl(\boldsymbol{x}^{k+1}_{a}- \boldsymbol{x}^{k}_{a}\bigr)+ \sigma \boldsymbol{u}^{k+1} \\ &=\boldsymbol{y}^{k}+\sigma \bigl[\boldsymbol{A}_{0} \bigl(2\boldsymbol{x}^{k+1}_{0}-\boldsymbol{x}^{k}_{0} \bigr)-b_{0}\bigr]+ \sigma p\boldsymbol{A}_{a}\bigl( \boldsymbol{x}^{k+1}_{a}-\boldsymbol{x}^{k}_{a} \bigr)+\boldsymbol{v}^{k+1}, \end{aligned}$$

where we have used the identity \(\boldsymbol{u}^{k+1}=\boldsymbol{A}\boldsymbol{x}^{k}-\boldsymbol{b}\) and define \(\boldsymbol{v}^{k}=\sigma \boldsymbol{u}^{k}-\sigma (\boldsymbol{A}_{0}\boldsymbol{x}^{k}_{0}-b_{0})= \sigma \sum_{a\in \mathcal{A}}(\boldsymbol{A}_{a}\boldsymbol{x}^{k}_{a}-b_{a})\). Exploiting Line 4 in Algorithm 2, we find that \(\boldsymbol{v}^{k}\) can be computed locally and inductively by choosing the initial condition \(\boldsymbol{v}^{0}=\sigma \sum_{a\in \mathcal{A}}(\boldsymbol{A}_{a}\boldsymbol{x}^{0}_{a}-b_{a})\) initially, and performing sequential updates

$$ \boldsymbol{v}^{k+1}=\boldsymbol{v}^{k}+\sigma \boldsymbol{A}_{a}\bigl(\boldsymbol{x}^{k+1}_{a}- \boldsymbol{x}^{k}_{a}\bigr)= \boldsymbol{v}^{k}+ \sigma \boldsymbol{s}^{k}, $$

where \(\boldsymbol{s}^{k}=\boldsymbol{A}_{a}(\boldsymbol{x}^{k+1}_{a}-\boldsymbol{x}^{k}_{a})\). This shows that

$$ \boldsymbol{y}^{k+1}=\boldsymbol{y}^{k}+\sigma \bigl[ \boldsymbol{A}_{0}\bigl(2\boldsymbol{x}^{k+1}_{0}- \boldsymbol{x}^{k}_{0}\bigr)-b_{0}\bigr]+ \sigma (p+1)\boldsymbol{s}^{k}+\boldsymbol{v}^{k}, $$

which agrees with Lines 4, 7 and 8 of Algorithm 3.

5.6 Numerical results

We apply Algorithm 3 to a realistic 15-bus network example based on the instance proposed in [5], over a time horizon \(\mathcal{T} = \{0, 1\}\). The parameters \((R_{n},X_{n},S_{n},B_{n},V_{n})\) are those used in [5]. We consider variable, flexible active and reactive loads (as opposed to fixed ones, as in [5]): parameters \((\underline{\boldsymbol{P}}_{n},\boldsymbol{\overline{P}}_{n},E_{n},\tau ^{ \mathrm{c}}_{n})_{n}\) are generated based on the values of [5]. The underlying parameter values are specified in Table 1: line parameters \(R_{n}\), \(X_{n}\), \(S_{n}\), \(B_{n}\), \(n\in \mathcal{N}\) are taken from [5], \((\underline{\boldsymbol{P}}_{n},\overline{\boldsymbol{P}}_{n},E_{n},\tau ^{ \mathrm{c}}_{n})_{n}\) for the flexible loads are generated based on the fixed load values of [5]. As in [5], bus 11 is the only bus to offer renewable production, with \(\overline{\boldsymbol{P}^{\mathrm{p}}}_{11} :=[0.438,0.201]\) and \(\underline{\rho ^{\mathrm{p}}}= \overline{\rho ^{\mathrm{p}}}= 0\) (the renewable production is fully active). The bounds \((\underline{V}_{n},\overline{V}_{n})\) are set to 0.81 and 1.21 for each \(n\in \mathcal{N}\), while \(V_{0}=1.0\).

Table 1 Parameters for the 15-bus network based on [5]

We consider a zero-cost function for each LA (\(\phi _{a}=0\) for each \(a\in \mathcal{A}\)), meaning that LAs are indifferent to consumption profiles for as long as their feasibility constraints are satisfied. This is a reasonable assumption in practice for certain types of consumption flexibilities (electric vehicles, batteries). We consider the DSO objective \(\phi (\boldsymbol{x})= \phi _{0}(\boldsymbol{x}_{0})= \sum_{t\in \mathcal{T}}c_{t}( p^{\mathrm{p}}_{0t}) + k^{\text{loss}} \sum_{n,t} R_{n} \ell _{nt}\), with loss-penalisation factor \(k^{\text{loss}}=0.001\) and with: \(c_{0}:p \mapsto 2p + p^{2} \), \(c_{1} : p \mapsto p \), giving an expensive time period and a cheap one, which can be interpreted as peak and offpeak periods.

The solution obtained by Algorithm 3 after 2000 iterations is illustrated in Fig. 1, which displays the active flows directions as well as the Distribution Locational Marginal Price (DLMP) values.

Figure 1
figure 1

Directions of active flows f and DLMPs \((y^{p},y^{q} )\) at the solution given by Algorithm 3. Saturated lines are dashed

The solutions show that the active (and reactive) DLMPs obtained for each time period are close to the DLMPs at the root node \((\boldsymbol{y}^{\mathrm{p}}_{0}, \boldsymbol{y}^{\mathrm{q}}_{0})\), with the following exceptions:

  • For the branch composed of nodes 8, 7, 9, 10, 11, active DLMPs are close to 0.0 due to the presence of renewable production (at null cost) at node 11, and of negative load at node 7, which together fully compensate for the demand on this branch. Since Line \((3,8)\) is saturated, no energy can be exported further.

  • Active DLMPs on the branch composed of nodes \((12,13,14)\) at \(t=1\) are much larger than on other nodes: this is explained by the congestion of line \((0,12)\).

  • The DLMP for node 7 and \(t=0\) is strictly negative: the (negative) consumption for this node is at its upper bound \(p_{7,0}=\overline{P}_{7,0}=-0.173\). The negative DLMP suggests that the system will be better off if less power is injected by node 7.

Convergence of Algorithm 3 for the 15-bus network is shown in Figs. 2 and 3. Figure 2 displays the convergence of the last iterate with respect to various criteria:

  1. (i)

    convergence of \(\phi (\boldsymbol{x}^{k})\) to the optimal cost \(\phi (\boldsymbol{x}^{\ast })\), where we computed \(\boldsymbol{x}^{\ast}\) using the CvxOpt Python library;

  2. (ii)

    convergence to zero of the primal residuals \(h(\boldsymbol{x}^{k})\) and \(\Vert A\boldsymbol{x}^{k}-b \Vert \);

  3. (iii)

    convergence to zero of the stopping criterion formed on the KKT residual \(R^{\text{KKT}}(\boldsymbol{x}^{k},y^{k})=\text{dist}_{\infty}(0,(\partial _{x},- \partial _{y})L(\boldsymbol{x}^{k},y^{k}))\), where \(L(\boldsymbol{x},y)=\Phi (\boldsymbol{x})+\langle{y},{A\boldsymbol{x}-b}\rangle \) denotes the Lagrangian of (P) in the feasible case;

  4. (iv)

    convergence of Fig. 3 shows the convergence to zero of the primal infeasibility in the averaged trajectory \(\boldsymbol{w}^{k}\), as predicted by the theory.

Figure 2
figure 2

Convergence of DLMPs \(y^{k}\) and the objective function value \(\Phi (\boldsymbol{x}^{k})\)

Figure 3
figure 3

Convergence of the average iterate \(\boldsymbol{w}^{k}\)

6 Conclusion

In this paper we developed a novel random block-coordinate descent method for ill-posed convex non-smooth optimisation problems. Our scheme gives optimal iteration complexity guarantees in terms of the last iterate of the sequence generated by the algorithm. As such, we directly generalise results from [26] to a fully distributed optimisation setting. Motivated by the need to achieve a distributed optimisation of the power system, we follow a data-driven approach leading to a potentially inconsistent AC-OPF formulation. We show that our Algorithmic scheme is immediately applicable to such a challenging and important setting, and achieves distributed control of the power grid via distributed locational marginal prices as price signals to independent agents (i.e. aggregators). Future work should consist in extending the method to non-convex optimisation problems, so that exact formulation of the power-flow constraints can be implemented. We also plan to conduct extensions of this work where the electric network is exposed to stochastic uncertainty.

Availability of data and materials

Not applicable.

Notes

  1. A preliminary version of this application is studied in our conference paper [43]. We refer to this paper for background and further motivation.

References

  1. Facchinei, F., Kanzow, C.: Generalized Nash equilibrium problems. 4OR 5, 173–210 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  2. Grasmair, M., Haltmeier, M., Scherzer, O.: The residual method for regularizing ill-posed problems. Appl. Math. Comput. 218(6), 2693–2710 (2011). https://doi.org/10.1016/j.amc.2011.08.009

    Article  MathSciNet  MATH  Google Scholar 

  3. Wood, A.J., Wollenberg, B.F., Sheblé, G.B.: Power Generation, Operation, and Control. Wiley, New York (2013)

    Google Scholar 

  4. Bienstock, D.: Electrical Transmission System Cascades and Vulnerability: An Operations Research Viewpoint. MOS-SIAM Series on Optimization (2015)

    Book  MATH  Google Scholar 

  5. Papavasiliou, A.: Analysis of distribution locational marginal prices. IEEE Trans. Smart Grid 9(5), 4872–4882 (2018). https://doi.org/10.1109/TSG.2017.2673860

    Article  Google Scholar 

  6. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  7. Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152(1), 615–642 (2015). https://doi.org/10.1007/s10107-014-0800-2

    Article  MathSciNet  MATH  Google Scholar 

  8. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1), 1–38 (2014). https://doi.org/10.1007/s10107-012-0614-z

    Article  MathSciNet  MATH  Google Scholar 

  9. Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  10. Glowinski, R., Marroco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualitéd’une classe de problèmes de dirichlet non linéaires. Rev. Fr. Autom. Inform. Rech. Opér., Anal. Numér. 9(R2), 41–76 (1975)

    MATH  Google Scholar 

  11. Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976). https://doi.org/10.1016/0898-1221(76)90003-1

    Article  MATH  Google Scholar 

  12. Han, D., Sun, D., Zhang, L.: Linear rate convergence of the alternating direction method of multipliers for convex composite programming. Math. Oper. Res. 43(2), 622–637 (2017). https://doi.org/10.1287/moor.2017.0875

    Article  MathSciNet  MATH  Google Scholar 

  13. Eckstein, J., Yao, W.: Understanding the convergence of the alternating direction method of multipliers: theoretical and computational perspectives. Pac. J. Optim. 11(4), 619–644 (2015)

    MathSciNet  MATH  Google Scholar 

  14. Chen, C., He, B., Ye, Y., Yuan, X.: The direct extension of admm for multi-block convex minimization problems is not necessarily convergent. Math. Program. 155(1), 57–79 (2016). https://doi.org/10.1007/s10107-014-0826-5

    Article  MathSciNet  MATH  Google Scholar 

  15. Gao, X., Xu, Y.-Y., Zhang, S.-Z.: Randomized primal–dual proximal block coordinate updates. J. Oper. Res. Soc. China 7(2), 205–250 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  16. Gao, X., Zhang, S.-Z.: First-order algorithms for convex optimization with nonseparable objective and coupled constraints. J. Oper. Res. Soc. China 5(2), 131–159 (2017). https://doi.org/10.1007/s40305-016-0131-5

    Article  MathSciNet  MATH  Google Scholar 

  17. Deng, W., Lai, M.-J., Peng, Z., Yin, W.: Parallel multi-block ADMM with \(o(1 / k)\) convergence. J. Sci. Comput. 71(2), 712–736 (2016). https://doi.org/10.1007/s10915-016-0318-2

    Article  MathSciNet  MATH  Google Scholar 

  18. Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(o(1/k^{2})\). Sov. Math. Dokl. 27(2), 372–376 (1983)

    MATH  Google Scholar 

  19. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542

    Article  MathSciNet  MATH  Google Scholar 

  20. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013). https://doi.org/10.1007/s10107-012-0629-5

    Article  MathSciNet  MATH  Google Scholar 

  21. Kang, M., Kang, M., Jung, M.: Inexact accelerated augmented Lagrangian methods. Comput. Optim. Appl. 62(2), 373–404 (2015). https://doi.org/10.1007/s10589-015-9742-8

    Article  MathSciNet  MATH  Google Scholar 

  22. Malitsky, Y.: The primal-dual hybrid gradient method reduces to a primal method for linearly constrained optimization problems (2019)

  23. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011). https://doi.org/10.1007/s10851-010-0251-1

    Article  MathSciNet  MATH  Google Scholar 

  24. Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization (2008)

  25. Malitsky, Y.: Chambolle-Pock and Tseng’s methods: relationship and extension to the bilevel optimization (2017)

  26. Luke, D.R., Malitsky, Y.: Block-coordinate primal-dual method for nonsmooth minimization over linear constraints. In: Giselsson, P., Rantzer, A. (eds.) Large-Scale and Distributed Optimization, pp. 121–147. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-97478-1_6

    Chapter  Google Scholar 

  27. Tran-Dinh, Q., Liu, D.: Faster randomized primal-dual algorithms for nonsmooth composite convex minimization. arXiv preprint (2020) arXiv:2003.01322

  28. Xu, Y., Zhang, S.: Accelerated primal-dual proximal block coordinate updating methods for constrained convex optimization. Comput. Optim. Appl. 70(1), 91–128 (2018). https://doi.org/10.1007/s10589-017-9972-z

    Article  MathSciNet  MATH  Google Scholar 

  29. Tran-Dinh, Q., Liu, D.: A new randomized primal-dual algorithm for convex optimization with fast last iterate convergence rates. Optim. Methods Softw. 38(1), 184–217 (2023). https://doi.org/10.1080/10556788.2022.2119233

    Article  MathSciNet  MATH  Google Scholar 

  30. Farivar, M., Low, S.H.: Branch flow model: relaxations and convexification—part I. IEEE Trans. Power Syst. 28(3), 2554–2564 (2013). https://doi.org/10.1109/TPWRS.2013.2255317

    Article  Google Scholar 

  31. Peng, Q., Low, S.H.: Distributed optimal power flow algorithm for radial networks, I: balanced single phase case. IEEE Trans. Smart Grid 9(1), 111–121 (2018). https://doi.org/10.1109/TSG.2016.2546305

    Article  Google Scholar 

  32. Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015). https://doi.org/10.1137/130949993

    Article  MathSciNet  MATH  Google Scholar 

  33. Ryu, E.K., Yin, W.: Large-Scale Convex Optimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press, Cambridge (2022). https://doi.org/10.1017/9781009160865

    Book  MATH  Google Scholar 

  34. Qu, Z., Richtárik, P.: Coordinate descent with arbitrary sampling I: algorithms and complexity. Optim. Methods Softw. 31(5), 829–857 (2016). https://doi.org/10.1080/10556788.2016.1190360

    Article  MathSciNet  MATH  Google Scholar 

  35. Qu, Z., Richtárik, P.: Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optim. Methods Softw. 31(5), 858–884 (2016). https://doi.org/10.1080/10556788.2016.1190361

    Article  MathSciNet  MATH  Google Scholar 

  36. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)

    Article  MATH  Google Scholar 

  37. Necoara, I., Clipici, D.: Efficient parallel coordinate descent algorithm for convex optimization problems with separable constraints: application to distributed MPC. J. Process Control 23(3), 243–253 (2013)

    Article  Google Scholar 

  38. Bianchi, P., Hachem, W., Iutzeler, F.: A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization. IEEE Trans. Autom. Control 61(10), 2947–2957 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  39. Fercoq, O., Bianchi, P.: A coordinate-descent primal-dual algorithm with large step size and possibly nonseparable functions. SIAM J. Optim. 29(1), 100–134 (2019). https://doi.org/10.1137/18M1168480

    Article  MathSciNet  MATH  Google Scholar 

  40. Latafat, P., Freris, N.M., Patrinos, P.: A new randomized block-coordinate primal-dual proximal algorithm for distributed optimization. IEEE Trans. Autom. Control 64(10), 4050–4065 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  41. Nesterov, Y.: Lectures on Convex Optimization. Springer Optimization and Its Applications, vol. 137. Springer, Berlin (2018)

    MATH  Google Scholar 

  42. d’Aspremont, A., Scieur, D., Taylor, A.: Acceleration methods. Found. Trends Optim. 5(1–2), 1–245 (2021)

    Google Scholar 

  43. Bilenne, O., Jacquot, P., Oudjane, N., Staudigl, M., Wan, C.: A privacy-preserving distributed computational approach for distributed locational marginal prices. In: 61st IEEE Conference on Decision and Control (2022)

    Google Scholar 

  44. Baran, M.E., Wu, F.F.: Optimal capacitor placement on radial distribution systems. IEEE Trans. Power Deliv. 4(1), 725–734 (1989). https://doi.org/10.1109/61.19265

    Article  Google Scholar 

  45. Molzahn, D.K., Hiskens, I.A.: A survey of relaxations and approximations of the power flow equations. Found. Trends Electr. Energy Syst. 4(1–2), 1–221 (2019). https://doi.org/10.1561/3100000012

    Article  Google Scholar 

  46. Gan, L., Li, N., Topcu, U., Low, S.H.: Exact convex relaxation of optimal power flow in radial networks. IEEE Trans. Autom. Control 60(1), 72–87 (2015). https://doi.org/10.1109/TAC.2014.2332712

    Article  MathSciNet  MATH  Google Scholar 

  47. Mezghani, I., Misra, S., Deka, D.: Stochastic ac optimal power flow: a data-driven approach. Electr. Power Syst. Res. 189, 106567 (2020)

    Article  Google Scholar 

  48. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2016)

    MATH  Google Scholar 

  49. Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Rustagi, J.S. (ed.) Optimizing Methods in Statistics, pp. 233–257. Academic Press, San Diego (1971). https://doi.org/10.1016/B978-0-12-604550-5.50015-8. https://www.sciencedirect.com/science/article/pii/B9780126045505500158

    Chapter  Google Scholar 

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. This research benefited from the support of the FMJH Program Gaspard Monge for optimisation and operations research and their interactions with data science.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally to this work.

Corresponding author

Correspondence to Mathias Staudigl.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Technical appendix

1.1 A.1 General facts

The following relations will be useful in the analysis.

Lemma 9

([48], Corollary 2.14)

For all \(\boldsymbol{x},\boldsymbol{y}\in \mathbb{R}^{d}\) and \(t\in [0,1]\), one has

$$ \bigl\Vert t\boldsymbol{x}+(1-t)\boldsymbol{y} \bigr\Vert ^{2}=t \Vert \boldsymbol{x} \Vert ^{2}+(1-t) \Vert \boldsymbol{y} \Vert ^{2}-t(1-t) \Vert \boldsymbol{x}-\boldsymbol{y} \Vert ^{2}. $$

Let \((\Omega ,\mathcal{F},(\mathcal{F}_{k})_{k\geq 0},\mathbb{P})\) be a filtered probability space satisfying the usual conditions. We need the celebrated Robbins–Siegmund Lemma for the convergence analysis [49].

Lemma 10

For every \(k\geq 0\), let \(\theta _{k}\), \(u_{k}\), \(\zeta _{k}\) and \(t_{k}\) be non-negative \(\mathcal{F}_{k}\)-measurable random variables such that \((\zeta _{k})_{n\geq 0}\) and \((t_{k})_{k\geq 0}\) are summable and for all \(k\geq 0\),

$$ \mathbb{E}[\theta _{k+1}\vert \mathcal{F}_{k}]\leq (1+t_{k})\theta _{k}+ \zeta _{k}-u_{k} \quad \mathbb{P}\textit{-a.s.} $$
(A.1)

Then, \((\theta _{k})_{k\geq 0}\) converges and \((u_{k})_{k\geq 0}\) is summable \(\mathbb{P}\)-a.s.

1.2 A.2 Connections among the iterates produced by Algorithms 1 and 2

To obtain a compact representation of the primal updates, let us introduce the iid Bernoulli process \(\epsilon _{i}^{k}:\Omega \to \{0,1\}\) by

$$ \epsilon _{i}^{k}(\omega )=\textstyle\begin{cases} 1 & \text{if }i\in \iota _{k}(\omega ), \\ 0 & \text{else} \end{cases} $$

and the random matrix \(\boldsymbol{E}_{k}=\operatorname{\mathtt{blkdiag}}[\epsilon _{1}^{k}\boldsymbol{I}_{m_{1}},\ldots , \epsilon ^{k}_{d}\boldsymbol{I}_{m_{d}}]\in \{0,1\}^{m\times m}\). This matrix corresponds to the identity for those blocks that are activated in round k, and zero for those not activated. Therefore, we call it the activation matrix of the Algorithm. By definition of the sampling process, we have \(\mathbb{E}(\epsilon ^{k}_{i})=\mathbb{P}(i\in \iota _{k})=\pi _{i}\) for all \(i\in [d]\) and \(k\geq 0\), so that \(\mathbb{E}[\boldsymbol{E}_{k}]=\boldsymbol{P}^{-1}\).

Let \(\epsilon ^{k}:=(\epsilon _{1}^{k},\ldots ,\epsilon _{d}^{k})\in \{0,1 \}^{d}\), and \(\mathcal{F}_{k}:=\sigma (\boldsymbol{x}^{0},\epsilon ^{0},\ldots ,\epsilon ^{k-1})\) denote the history of the process up to step k. We denote by \(\mathbb{E}_{k}[\cdot ]:=\mathbb{E}[\cdot \vert \mathcal{F}_{k}]\) the conditional expectation at stage k. Denoting by \(\hat{\boldsymbol{x}}^{k+1}=\operatorname{\mathsf{T}}^{k}(\boldsymbol{x}^{k})\), we obtain the compact representation of the primal update as

$$ \boldsymbol{x}^{k+1}=\boldsymbol{x}^{k}+ \boldsymbol{E}_{k}\bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k}\bigr). $$
(A.2)

Furthermore, \(\mathbb{E}_{k}(\boldsymbol{x}^{k+1})=\boldsymbol{P}^{-1}\hat{\boldsymbol{x}}^{k+1}+(\boldsymbol{I}- \boldsymbol{P}^{-1})\boldsymbol{x}^{k}\), so that

$$ \mathbb{E}_{k}\bigl[\boldsymbol{x}^{k}+ \boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr)\bigr]= \hat{\boldsymbol{x}}^{k+1}. $$
(A.3)

Lemma 11

For all \(k\geq 0\), we have

$$ \begin{aligned} \frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{w}^{k} \bigr) \bigr\Vert ^{2}&= \frac{1}{2}\mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}^{k}+ \boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr)-\boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2} \bigr] \\ &\quad {}-\frac{1}{2}\mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}(\boldsymbol{P}\boldsymbol{E}_{k}-\boldsymbol{I}) \bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigr\Vert ^{2} \bigr]. \end{aligned} $$
(A.4)

Proof

Note that for any random variable \(X:\Omega \to \mathbb{R}^{q}\), it holds true that \(\Vert \mathbb{E}[X] \Vert ^{2}=\mathbb{E} [ \Vert X \Vert ^{2} ]- \mathbb{E} [ \Vert X-\mathbb{E}X \Vert ^{2} ]\). Set \(X:=\boldsymbol{A}(\boldsymbol{x}^{k}+\boldsymbol{P}(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k})-\boldsymbol{w}^{k})\). Then, by Eq. (A.3), we see that \(\mathbb{E}_{k}[X]=\boldsymbol{A}(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{w}^{k})\). Furthermore,

$$ X-\mathbb{E}_{k}[X]=\boldsymbol{A} \bigl(( \boldsymbol{I}-\boldsymbol{P})\boldsymbol{x}^{k}+\boldsymbol{P} \boldsymbol{x}^{k+1}-\hat{\boldsymbol{x}}^{k+1} \bigr)= \boldsymbol{A}\bigl(\boldsymbol{x}^{k}+\boldsymbol{P}\bigl( \boldsymbol{x}^{k+1}-\boldsymbol{x}^{k}\bigr)-\hat{ \boldsymbol{x}}^{k+1}\bigr). $$
(A.5)

Using (A.2), we obtain

$$ \boldsymbol{P}\boldsymbol{x}^{k+1}-\hat{\boldsymbol{x}}^{k+1}=( \boldsymbol{P}\boldsymbol{E}_{k}-\boldsymbol{I}) \hat{ \boldsymbol{x}}^{k+1}+\boldsymbol{P}(\boldsymbol{I}- \boldsymbol{E}_{k})\boldsymbol{x}^{k}, $$

which implies that

$$ \boldsymbol{A}\bigl(\boldsymbol{x}^{k}+ \boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr)-\hat{\boldsymbol{x}}^{k+1}\bigr)=X- \mathbb{E}_{k}[X]= \boldsymbol{A}(\boldsymbol{P}\boldsymbol{E}_{k}-\boldsymbol{I}) \bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr). $$
(A.6)

Collecting all the terms above, one obtains

$$\begin{aligned} \frac{1}{2}\mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}^{k}+\boldsymbol{P}\bigl( \boldsymbol{x}^{k+1}-\boldsymbol{x}^{k}\bigr)- \boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2} \bigr]&= \frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2} \\ &\quad {}+\frac{1}{2}\mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}(\boldsymbol{P}\boldsymbol{E}_{k}-\boldsymbol{I}) \bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigr\Vert ^{2} \bigr]. \end{aligned}$$

 □

The next Lemma provides a compact expression for the correction term in Eq. (A.4).

Lemma 12

Let \(\boldsymbol{\Xi}=\mathbb{E}[\boldsymbol{E}_{k}\boldsymbol{P}\boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{P}\boldsymbol{E}_{k}]\), which is a symmetric time-independent matrix, thanks to iid sampling. We have

$$ \mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}( \boldsymbol{P}\boldsymbol{E}_{k}-\boldsymbol{I}) \bigl(\hat{ \boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigr\Vert ^{2} \bigr]= \bigl\Vert \hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr\Vert ^{2}_{\Xi -\boldsymbol{A}^{\top} \boldsymbol{A}}. $$
(A.7)

Proof

By definition of the sampling procedure, we have \(\mathbb{E}_{k}[\boldsymbol{P}\boldsymbol{E}_{k}]=\boldsymbol{I}_{m}\). Some simple algebra then shows that

$$\begin{aligned} \mathbb{E}_{k} \bigl[ \bigl\Vert \boldsymbol{A}(\boldsymbol{P} \boldsymbol{E}_{k}-\boldsymbol{I}) \bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k}\bigr) \bigr\Vert ^{2} \bigr]&= \mathbb{E}_{k} \bigl[\bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k}\bigr)^{\top} \boldsymbol{M}_{k} \bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigr], \end{aligned}$$

where \(\boldsymbol{M}_{k}:=\boldsymbol{E}_{k}\boldsymbol{P}\boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{P}\boldsymbol{E}_{k}- \boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{P}\boldsymbol{E}_{k}-\boldsymbol{E}_{k}\boldsymbol{P}\boldsymbol{A}^{\top} \boldsymbol{A}+\boldsymbol{A}^{\top}\boldsymbol{A}\) is a symmetric matrix, satisfying \(\mathbb{E}_{k}\boldsymbol{M}_{k}=\boldsymbol{\Xi}-\boldsymbol{A}^{\top}\boldsymbol{A}\). Hence,

$$\begin{aligned} \mathbb{E}_{k} \bigl[\bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k}\bigr)^{\top}\boldsymbol{M}_{k} \bigl( \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigr]&=\mathbb{E}_{k} \bigl[\operatorname{Tr}\bigl(\boldsymbol{M}_{k} \bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigl(\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{x}^{k} \bigr)^{\top} \bigr) \bigr] \\ &=\operatorname{Tr}\bigl(\mathbb{E}_{k}[\boldsymbol{M}_{k}] \bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigl( \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr)^{\top} \bigr) \\ &=\operatorname{Tr}\bigl(\bigl(\Xi -\boldsymbol{A}^{\top}\boldsymbol{A}\bigr) \bigl(\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k}\bigr) \bigl( \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr)^{\top} \bigr) \\ &= \bigl\Vert \hat{\boldsymbol{x}}^{k+1}-\boldsymbol{x}^{k} \bigr\Vert _{\boldsymbol{\Xi}-\boldsymbol{A}^{\top}\boldsymbol{A}}. \end{aligned}$$

 □

Remark 4

The matrix Ξ can be given a simple expression in terms of the probability matrix Π. A direct computation shows that Ξ is a block-symmetric matrix with \(d^{2}\) blocks \(\boldsymbol{\Xi}[i,j],1\leq i,j\leq d\), each block having the form

$$ \boldsymbol{\Xi}[i,j]=\frac{(\boldsymbol{\Pi})_{ij}}{\pi _{i}\pi _{j}}\boldsymbol{A}^{\top}_{i} \boldsymbol{A}_{j}^{\top}\quad \text{if }i\neq j,\quad \text{and}\quad \boldsymbol{\Xi}[i,i]= \frac{1}{\pi _{i}} \boldsymbol{A}^{\top}_{i}\boldsymbol{A}_{i} \quad \forall i\in [d]. $$

In special instances this matrix can be simplified even further:

  1. 1.

    If \(\mathbb{P}(\mathopen{\lvert }\mathcal{I}\mathclose{\rvert }=1)=1\) the random sampling is a single-coordinate activation mechanism. In this case, the associated matrix Π is diagonal with entries \(\pi _{i}\). It follows that \(\boldsymbol{\Xi}[i,i]=\frac{1}{\pi _{i}}\boldsymbol{A}_{i}^{\top}\boldsymbol{A}_{i}\) for all \(i\in [d]\) and \(\boldsymbol{\Xi}[i,j]=0\) for \(i\neq j\).

  2. 2.

    We say that the matrix A has orthogonal design if \(\boldsymbol{A}_{i}^{\top}\boldsymbol{A}_{j}\) is the zero matrix for \(i\neq j\). Also, in this case the matrix Ξ is block diagonal, with the same coordinate expression as above.

1.3 A.3 Properties of the penalty function h

We collect some important identities involving the penalty function h in this subsection. From the definition of the iterate \(\boldsymbol{z}^{k}\), we see that

$$ \begin{aligned} \nabla h\bigl(\boldsymbol{z}^{k} \bigr)&=\boldsymbol{A}^{\top}\bigl(\boldsymbol{A}\boldsymbol{z}^{k}- \boldsymbol{b}\bigr)=\frac{1}{S_{k}} \bigl(S_{k-1} \boldsymbol{A}^{\top}\bigl(\boldsymbol{A}\boldsymbol{w}^{k}- \boldsymbol{b}\bigr)+\sigma _{k}\boldsymbol{A}^{\top}\bigl( \boldsymbol{A}\boldsymbol{x}^{k}-\boldsymbol{b}\bigr) \bigr) \\ &=(1-\theta _{k})\nabla h\bigl(\boldsymbol{w}^{k} \bigr)+\theta _{k}\nabla h\bigl(\boldsymbol{x}^{k} \bigr). \end{aligned} $$
(A.8)

Furthermore, we note that the definition of the iterate \(\boldsymbol{w}^{k+1}\) implies that

$$ \begin{aligned} h\bigl(\boldsymbol{w}^{k+1} \bigr)&=h \bigl((1-\theta _{k})\boldsymbol{w}^{k}+ \theta _{k}\bigl(\boldsymbol{x}^{k}+\boldsymbol{P}\bigl( \boldsymbol{x}^{k+1}-\boldsymbol{x}^{k}\bigr)\bigr) \bigr) \\ &\overset{\text{(6)}}{=}(1-\theta _{k})h\bigl( \boldsymbol{w}^{k}\bigr)+ \theta _{k}h\bigl( \boldsymbol{x}^{k}+\boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}- \boldsymbol{x}^{k}\bigr)\bigr) \\ &\quad {}-\frac{\theta _{k}(1-\theta _{k})}{2} \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}^{k}+ \boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr)-\boldsymbol{w}^{k}\bigr) \bigr\Vert ^{2}. \end{aligned} $$
(A.9)

Lemma 13

We have

$$ \mathbb{E}_{k}\bigl[h\bigl(\boldsymbol{x}^{k}+ \boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr)\bigr)\bigr]=h\bigl( \hat{\boldsymbol{x}}^{k+1}\bigr)+ \frac{1}{2}\mathbb{E}_{k}\bigl[ \bigl\Vert \boldsymbol{A} \bigl(\boldsymbol{x}^{k}+\boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}- \boldsymbol{x}^{k}\bigr)\bigr)-\hat{\boldsymbol{x}}^{k+1} \bigr\Vert ^{2}\bigr]. $$
(A.10)

Proof

The Pythagorean identity gives

$$\begin{aligned} h\bigl(\boldsymbol{x}^{k}+\boldsymbol{P}\bigl(\boldsymbol{x}^{k+1}- \boldsymbol{x}^{k}\bigr)\bigr)&=\frac{1}{2} \bigl\Vert \bigl(\boldsymbol{A}\hat{\boldsymbol{x}}^{k+1}-\boldsymbol{b}\bigr)+ \boldsymbol{A}\bigl(\boldsymbol{x}^{k}+\boldsymbol{P}\bigl( \boldsymbol{x}^{k+1}-\boldsymbol{x}^{k}\bigr)-\hat{ \boldsymbol{x}}^{k+1}\bigr) \bigr\Vert ^{2} \\ &=h\bigl(\hat{\boldsymbol{x}}^{k+1}\bigr)+\frac{1}{2} \bigl\Vert \boldsymbol{A}\bigl(\boldsymbol{x}^{k}+\boldsymbol{P}\bigl( \boldsymbol{x}^{k+1}-\boldsymbol{x}^{k}\bigr)-\hat{ \boldsymbol{x}}^{k+1}\bigr) \bigr\Vert ^{2} \\ &\quad {}+\bigl\langle {\boldsymbol{A}\hat{\boldsymbol{x}}^{k+1}- \boldsymbol{b}},{\boldsymbol{A}\bigl(\boldsymbol{x}^{k}+ \boldsymbol{P}\bigl( \boldsymbol{x}^{k+1}-\boldsymbol{x}^{k} \bigr)-\hat{\boldsymbol{x}}^{k+1}\bigr)}\bigr\rangle . \end{aligned}$$

Taking conditional expectations on both sides, using Eq. (A.3), establishes the claimed identity. □

Appendix B: Step-size policy for the accelerated algorithm

Assume that the parameter-coupling equation (34) holds, where

$$ \alpha =\lambda _{\max}\bigl(\boldsymbol{M}^{-1/2}\Xi \boldsymbol{M}^{-1/2}\bigr)^{-1}\quad \text{and} \quad \beta = \alpha \lambda _{\max}\bigl(\boldsymbol{M}^{-1/2}(\Lambda + \Upsilon )\boldsymbol{M}^{-1/2}\bigr). $$

Set \(\kappa =\frac{\beta}{\alpha}\). With the choice \(\boldsymbol{B}=\boldsymbol{P}^{-2}\Upsilon \), the matrix inequality (24) reads as

$$ \bigl[(\alpha -\beta \tau _{k}) (\boldsymbol{I}+\tau _{k}\boldsymbol{P})-\beta \tau ^{2}_{k}( \boldsymbol{I}-\boldsymbol{P})\bigr]\tau _{k+1}^{2}+\tau _{k}^{2}\bigl[\beta \boldsymbol{I}+\alpha ( \boldsymbol{I}-\boldsymbol{P})\bigr]\tau _{k+1}-\alpha \tau _{k}^{2}\boldsymbol{I}\succeq 0. $$

Exploiting the block structure of the involved matrices, we can equivalently write this condition as a system of quadratic inequalities

$$ \tau ^{2}_{k+1} \bigl[(\alpha -\beta \tau _{k}) (\pi _{i}+\tau _{k})+ \beta (1- \pi _{i})\tau ^{2}_{k} \bigr]-\tau _{k+1}\tau ^{2}_{k}\bigl[ \alpha (1-\pi _{i})-\beta \pi _{i}\bigr]-\tau ^{2}_{k} \alpha \pi _{i}\geq 0, $$
(B.1)

holding for all \(i\in [d]\) simultaneously. This defines a quadratic inequality in \(\tau _{k+1}\) of the form \(c_{i,1}(\tau _{k})\tau ^{2}_{k+1}+c_{i,2}(\tau _{k})\tau _{k+1}-c_{i,3}( \tau _{k})\geq 0\), with coefficients

$$\begin{aligned} &c_{i,1}(y):=(\alpha -\beta y) (\pi _{i}+y)+\beta (1- \pi _{i})y^{2}= \alpha \pi _{i} \bigl[1- \kappa y^{2}-y(\kappa -1/\pi _{i}) \bigr], \\ &c_{i,2}(y):=y^{2}\bigl[\beta \pi _{i}- \alpha (1-\pi _{i})\bigr]=\alpha \pi _{i}y^{2}( \kappa -1/\pi _{i}+1), \\ &c_{i,3}(y):=y^{2}\alpha \pi _{i}. \end{aligned}$$

The structure of the coefficients shows that we can eliminate the parameters α and \(\pi _{i}\), and just continue with the coefficients

$$\begin{aligned} a_{i,1}(y):=1-\kappa y^{2}-y(\kappa -1/\pi _{i}), \qquad a_{i,2}(y):=y^{2}( \kappa -1/\pi _{i}+1), \qquad a_{i,3}(y):=y^{2}. \end{aligned}$$

Define the function \(F:\mathbb{R}^{2}\to \mathbb{R}\) by

$$ F(x,y):=\max_{i\in [d]}\bigl\{ a_{i,1}(y)x^{2}+a_{i,2}(y)x-a_{i,3}(y) \bigr\} \equiv \max_{i\in [d]}F_{i}(x,y). $$

This gives us an implicit relation for the step-size policy (40) by \(F(\tau _{k+1},\tau _{k})=0\) from which we take the positive solution, if it exists. To analyse this implicit relation, we proceed as follows.

Lemma 14

The equation \(F(x,0)=0\) has the unique solution \(x=0\). Furthermore, the equation \(F(x,\alpha /\beta )=0\) has the unique positive solution \(x=\frac{\alpha}{\beta}=\frac{1}{\kappa}\).

Proof

This is a direct numerical computation. □

By definition we have \(\kappa =\max_{i\in [d]}\frac{L_{i}+\mu _{i}}{\mu _{i}\pi _{i}}\geq \frac{1}{\pi _{i}}>1\) for all \(i\in [d]\) as well. Consequently,

$$ \delta _{i}:=\kappa -\frac{1}{\pi _{i}}\geq 0\quad \forall i\in [d] $$

and we obtain the concise representation \(a_{i,1}(y)=1-\kappa y^{2}-y\delta _{i}\), as well as \(a_{i,2}(y)=y^{2}(\delta +1)\). It is clear that \(a_{i,1}(y)>0\) for \(0< y<\frac{-\delta _{i}+\sqrt{\delta _{i}^{2}+4\kappa}}{2\kappa}\). Since \(\pi _{i}\in (0,1)\), it can be readily verified that restricting \(y\in (0,1/\kappa )\) is a sufficient condition for \(a_{i,1}(y)>0\). Hence, if \(\tau _{k}\in (0,1/\kappa )\), then the characterisation of the step size reduces to finding the root of the polynomial equation

$$ \tau ^{2}_{k+1}+\frac{a_{2}(\tau _{k})}{a_{1}(\tau _{k})}\tau _{k+1}- \frac{a_{3}(\tau _{k})}{a_{1}(\tau _{k})}=0. $$

Since \(\frac{a_{2}(y)}{a_{1}(y)}=y^{2} \frac{\delta +1}{1-\kappa y^{2}-y\delta}\), and \(\frac{a_{3}(y)}{a_{1}(y)}=y^{2}\frac{1}{1-\kappa y^{2}-\delta y}\), we set

$$ b_{k}:=\frac{\delta +1}{1-\tau _{k}^{2}\kappa -\delta \tau _{k}}\quad \text{provided that }\tau _{k}\in (0,1/\kappa ), $$

so that our polynomial becomes

$$ \tau ^{2}_{k+1}+\tau ^{2}_{k}b_{k} \tau _{k+1}-\tau ^{2}_{k} \frac{b_{k}}{\delta +1}=0. $$
(B.2)

This has the unique positive solution

$$ \tau _{k+1}= \frac{\tau _{k} [-b_{k}\tau _{k}+\sqrt{b_{k}^{2}\tau ^{2}_{k}+4b_{k}/(\delta +1)} ]}{2}= \frac{2}{\delta +1} \frac{1}{1+\sqrt{1+\frac{4}{b_{k}\tau ^{2}_{k}(1+\delta )}}}. $$
(B.3)

Setting \(d_{k}:=1-\delta _{k}\tau _{k}-\kappa \tau ^{2}_{k}\), and \(s_{k}:=(1+\delta )\tau _{k}\), this recursion can be equivalently stated as

$$ s_{k+1}=\frac{2s_{k}}{s_{k}+\sqrt{s^{2}_{k}+4d_{k}}}. $$

Hence, we note that

$$\begin{aligned} s_{k+1}= \frac{2}{1+\sqrt{1+\frac{4d_{k}}{\tau ^{2}_{k}(1+\delta )^{2}}}}= \frac{2}{1+\sqrt{1+\frac{4d_{k}}{s^{2}_{k}}}}= \frac{2s_{k}}{s_{k}+\sqrt{s^{2}_{k}+4d_{k}}}. \end{aligned}$$

Hence, \(\frac{s_{k+1}}{s_{k}}=\frac{2}{1+\sqrt{1+\frac{4d_{k}}{s^{2}_{k}}}} \leq 1\) using that \(d_{k}>0\). It follows by induction that \(s_{k}\leq s_{0}=(1+\delta )\tau _{0}\). We deduce that the sequence \((\tau _{k})_{k}\) is monotonically decreasing and non-negative. Therefore, \(\lim_{k\to \infty}\tau _{k}=\tau _{\infty}\geq 0\) exists. We argue that \(\tau _{\infty}=0\). Suppose that \(\tau _{\infty}\geq t>0\), then \(s_{\infty}=\lim_{k\to \infty}s_{k}=(1+\delta )\tau _{\infty}\), and \(\lim_{k\to \infty}(s_{k+1}-s_{k})=0\). However, then

$$ s_{\infty}= \frac{2s_{\infty}}{s_{\infty}+\sqrt{s^{2}_{\infty}+4d_{\infty}}}< s_{ \infty}, $$

a contradiction! We conclude \(\tau _{\infty}=0\). If follows that \(\lim_{k\to \infty}b_{k}=1+\delta \), and \(\lim_{k\to \infty}d_{k}=1\). Moreover, \(d_{k}\uparrow 1\). Hence,

$$ s_{k+1}=\frac{2s_{k}}{s_{k}+\sqrt{s^{2}_{k}+4d_{k}}}\geq \frac{2s_{k}}{s_{k}+\sqrt{s^{2}_{k}+4}}= \frac{2}{1+\sqrt{1+4/s_{k}^{2}}}, $$

which is equivalent to \(\frac{1-s_{k+1}}{s^{2}_{k+1}}\leq \frac{1}{s^{2}_{k}}\). It follows from [24] that \(s_{k}\geq \frac{2s_{0}}{ks_{0}+2}\) for all \(k\geq 0\), which implies that

$$ \tau _{k}\geq \frac{2\tau _{0}}{(1+\delta )\tau _{0}k+2}, \quad \text{provided }\tau _{0}\in (0,1/\kappa ). $$

This implies via our parameter coupling \(\sigma _{k}=\alpha /\tau _{k}-\beta \) for \(k\geq 1\), that

$$\begin{aligned} S_{k}&=1+\sum_{i=1}^{k} \frac{\alpha}{\tau _{i}}-k\beta \leq 1+ \frac{\alpha (1+\delta )}{2}\sum _{i=1}^{k}i+k \biggl( \frac{\alpha}{\tau _{0}}-\beta \biggr) \\ &=1+\frac{\alpha (1+\delta )}{4}k(k+1)+k\sigma _{0}. \end{aligned}$$

Hence, \(S_{k}=O(k^{2})\).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Staudigl, M., Jacquot, P. Random block-coordinate methods for inconsistent convex optimisation problems. Fixed Point Theory Algorithms Sci Eng 2023, 14 (2023). https://doi.org/10.1186/s13663-023-00751-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13663-023-00751-0

Mathematics Subject Classification

Keywords