Routine: PZGBTRF()  File: SRC\pzgbtrf.f

 
 
# lines: 1110
  # code: 1110
  # comment: 0
  # blank:0
# Variables:30
# Callers:1
# Callings:0
# Words:37
# Keywords:18
 

 

..
     .. Local Scalars ..
     ..
     .. Local Arrays ..
     ..
     .. External Subroutines ..
     ..
     .. External Functions ..
     ..
     .. Intrinsic Functions ..
     ..
     .. Executable Statements ..
     Test the input parameters
     Convert descriptor into standard form for easy access to
        parameters, check that grid is of right shape.
     Get values out of descriptor for use in code.
     Get grid parameters
     Argument checking that is specific to Divide & Conquer routine
     Check auxiliary storage size
        put minimum value of laf into AF( 1 )
     Check worksize
        put minimum value of work into work( 1 )
     Pack params and positions into arrays for global consistency check
     Want to find errors with MIN( ), so if no error, set it to a big
     number. If there already is an error, multiply by the the
     descriptor multiplier.
     Check consistency across processors
     Prepare output: set info = 0 if no error, and divide by DESCMULT
     if error is not in a descriptor entry.
     Quick return if possible
     Adjust addressing into matrix space to properly get into
        the beginning part of the relevant data
     Form a new BLACS grid (the "standard form" grid) with only procs
        holding part of the matrix, of size 1xNP where NP is adjusted,
        starting at csrc=0, with JA modified to reflect dropped procs.
     First processor to hold part of the matrix:
     Calculate new JA one while dropping off unused processors.
     Save and compute new value of NP
     Call utility routine that forms "standard-form" grid
     Use new context from standard grid as context.
     Get information about new grid.
     Drop out processors that do not have part of the matrix.
     ********************************
     Values reused throughout routine
     User-input value of partition size
     Number of columns in each processor
     Offset in columns to beginning of main partition in each proc
     Offset in elements
     Size of main (or odd) partition in each processor
     Zero out space for fillin
     Begin main code
*******************************************************************
     PHASE 1: Local computation phase.
*******************************************************************
     Transfer triangle B_i of local matrix to next processor
     for fillin. Overlap the send with the factorization of A_i.
     The last processor does not need to send anything.
     BIPTR = location of triangle B_i in memory
     Factor main partition P_i A_i = L_i U_i on each processor
     LBWL, LBWU: lower and upper bandwidth of local solver
     Note that for MYCOL > 0 one has lower triangular blocks!
     LM is the number of rows which is usually NB except for
     MYCOL = 0 where it is BWU less and MYCOL=NPCOL-1 where it
     is NR+BWU where NR is the number of columns on the last processor
     Finally APTR is the pointer to the first element of A. As LAPACK
     has a slightly different matrix format than Scalapack the pointer
     has to be adjusted on processor MYCOL=0.
     Update the last BW columns of A_i (code modified from ZGBTRS)
     Only the eliminations of unknowns > LN-BW have an effect on
     the last BW columns. Loop over them...
        Element (L,LN+1) is swapped with element (J,LN+1) etc
        Furthermore, the elements in the same row are LDB=LLDA-1 apart
        The complicated formulas are to cope with the banded
          data format:
              LPTR is the pointer to the beginning of the
                 coefficients of L
      Compute spike fill-in, L_i F_i = P_i B_{i-1}
      Receive triangle B_{i-1} from previous processor
      Permutation and forward elimination (triang. solve)
*******************************************************************
     PHASE 2: Formation and factorization of Reduced System.
*******************************************************************
     Define the initial dimensions of the diagonal blocks
     The offdiagonal blocks (for MYCOL > 0) are of size BM by BW
     Pointer to first element of block bidiagonal matrix in AF
     Leading dimension of block bidiagonal system
     Copy from A and AF into block bidiagonal matrix (tail of AF)
     DBPTR = Pointer to diagonal blocks in A
     Zero out any junk entries that were copied
        ODPTR = Pointer to offdiagonal blocks in A
        In this case the loop over the levels will not be
        performed.
     Loop over levels ... only occurs if npcol > 1
     The two integers NPACT (nu. of active processors) and NPSTR
     (stride between active processors) are used to control the
     loop.
       Begin loop over levels
         Test if processor is active
   Send/Receive blocks
            This node will potentially do more work later
                  Last processor skips to next level
               BM1 = M for 1st block on proc pair, BM2 2nd block
                     Copy diagonal block to align whole system
               This node stops work after this stage -- an extra copy
               is required to make the odd and even frontal matrices
               look identical
                  Copy diagonal block to align whole system
            LU factorization with partial pivoting
   Backsolve left side
               Use partial factors to update remainder
   Backsolve right side
               Use partial factors to update remainder
     Test if processor is active in next round
                  Reset BM
                  Local copying in the block bidiagonal area
                  Zero out space that held original copy
               Factor the final 2 by 2 block matrix
        Last processor in an odd-sized NPACT skips to here
     End loop over levels
     If error was found in Phase 1, processors jump here.
     Free BLACS space used to hold standard-form grid.
     If this processor did not hold part of the grid it
        jumps here.
     Restore saved input parameters
     Output worksize

 
Display dynamic version Find AutoScroll Reload FontSize: - + Hide Comments Hide Blanks Frame FullScreen MailPrint

 
001        SUBROUTINE PZGBTRF( N , BWL , BWU , A , JA , DESCA , IPIV , AF , LAF ,
002       $WORK , LWORK , INFO )
003  
004  *     -- ScaLAPACK routine(version 1.7) --
005  *     University of Tennessee , Knoxville , Oak Ridge National Laboratory ,
006  *     and University of California , Berkeley.
007  *     November 15 , 1997
008  
009  *     .. Scalar Arguments ..
010        INTEGER BWL , BWU , INFO , JA , LAF , LWORK , N
011  *     ..
012  *     .. Array Arguments ..
013        INTEGER DESCA( * ) , IPIV( * )
014        COMPLEX*16 A( * ) , AF( * ) , WORK( * )
015  *     ..
016  
017  *     Purpose
018  *     === ====
019  
020  *     PZGBTRF computes a LU factorization
021  *     of an N - by - N complex banded
022  *     distributed matrix
023  *     with bandwidth BWL , BWU : A(1 : N , JA : JA + N - 1).
024  *     Reordering is used to increase parallelism in the factorization.
025  *     This reordering results in factors that are DIFFERENT from those
026  *     produced by equivalent sequential codes. These factors cannot
027  *     be used directly by users ; however , they can be used in
028  *     subsequent calls to PZGBTRS to solve linear systems.
029  
030  *     The factorization has the form
031  
032  *     P A(1 : N , JA : JA + N - 1) Q = L U
033  
034  *     where U is a banded upper triangular matrix and L is banded
035  *     lower triangular , and P and Q are permutation matrices.
036  *     The matrix Q represents reordering of columns
037  *     for parallelism's sake , while P represents
038  *     reordering of rows for numerical stability using
039  *     classic partial pivoting.
040  
041  *     === ==================================================================
042  
043  *     Arguments
044  *     === ======
045  
046  *     N(global input) INTEGER
047  *     The number of rows and columns to be operated on , i.e. the
048  *     order of the distributed submatrix A(1 : N , JA : JA + N - 1). N >= 0.
049  
050  *     BWL(global input) INTEGER
051  *     Number of subdiagonals. 0 <= BWL <= N - 1
052  
053  *     BWU(global input) INTEGER
054  *     Number of superdiagonals. 0 <= BWU <= N - 1
055  
056  *     A(local input / local output) COMPLEX*16 pointer into
057  *     local memory to an array with first dimension
058  *     LLD_A >=(2*bwl + 2*bwu + 1)(stored in DESCA).
059  *     On entry , this array contains the local pieces of the
060  *     N - by - N unsymmetric banded distributed matrix
061  *     A(1 : N , JA : JA + N - 1) to be factored.
062  *     This local portion is stored in the packed banded format
063  *     used in LAPACK. Please see the Notes below and the
064  *     ScaLAPACK manual for more detail on the format of
065  *     distributed matrices.
066  *     On exit , this array contains information containing details
067  *     of the factorization.
068  *     Note that permutations are performed on the matrix , so that
069  *     the factors returned are different from those returned
070  *     by LAPACK.
071  
072  *     JA(global input) INTEGER
073  *     The index in the global array A that points to the start of
074  *     the matrix to be operated on(which may be either all of A
075  *     or a submatrix of A).
076  
077  *     DESCA(global and local input) INTEGER array of dimension DLEN.
078  *     if 1D type(DTYPE_A = 501) , DLEN >= 7 ;
079  *     if 2D type(DTYPE_A = 1) , DLEN >= 9 .
080  *     The array descriptor for the distributed matrix A.
081  *     Contains information of mapping of A to memory. Please
082  *     see NOTES below for full description and options.
083  
084  *     IPIV(local output) INTEGER array , dimension >= DESCA( NB ).
085  *     Pivot indices for local factorizations.
086  *     Users *should not* alter the contents between
087  *     factorization and solve.
088  
089  *     AF(local output) COMPLEX*16 array , dimension LAF.
090  *     Auxiliary Fillin Space.
091  *     Fillin is created during the factorization routine
092  *     PZGBTRF and this is stored in AF. If a linear system
093  *     is to be solved using PZGBTRS after the factorization
094  *     routine , AF *must not be altered* after the factorization.
095  
096  *     LAF(local input) INTEGER
097  *     Size of user - input Auxiliary Fillin space AF. Must be >=
098  *     (NB + bwu)*(bwl + bwu) + 6*(bwl + bwu)*(bwl + 2*bwu)
099  *     If LAF is not large enough , an error code will be returned
100  *     and the minimum acceptable size will be returned in AF( 1 )
101  
102  *     WORK(local workspace / local output)
103  *     COMPLEX*16 temporary workspace. This space may
104  *     be overwritten in between calls to routines. WORK must be
105  *     the size given in LWORK.
106  *     On exit , WORK( 1 ) contains the minimal LWORK.
107  
108  *     LWORK(local input or global input) INTEGER
109  *     Size of user - input workspace WORK.
110  *     If LWORK is too small , the minimal acceptable size will be
111  *     returned in WORK(1) and an error code is returned. LWORK >=
112  *     1
113  
114  *     INFO(global output) INTEGER
115  *     = 0 : successful exit
116  *     < 0 : If the i - th argument is an array and the j - entry had
117  *     an illegal value , then INFO = - (i*100 + j) , if the i - th
118  *     argument is a scalar and had an illegal value , then
119  *     INFO = - i.
120  *     > 0 : If INFO = K <= NPROCS , the submatrix stored on processor
121  *     INFO and factored locally was not
122  *     nonsingular , and
123  *     the factorization was not completed.
124  *     If INFO = K > NPROCS , the submatrix stored on processor
125  *     INFO - NPROCS representing interactions with other
126  *     processors was not
127  *     nonsingular ,
128  *     and the factorization was not completed.
129  
130  *     === ==================================================================
131  
132  *     Restrictions
133  *     === =========
134  
135  *     The following are restrictions on the input parameters. Some of these
136  *     are temporary and will be removed in future releases , while others
137  *     may reflect fundamental technical limitations.
138  
139  *     Non - cyclic restriction : VERY IMPORTANT !
140  *     P*NB >= mod(JA - 1 , NB) + N.
141  *     The mapping for matrices must be blocked , reflecting the nature
142  *     of the divide and conquer algorithm as a task - parallel algorithm.
143  *     This formula in words is : no processor may have more than one
144  *     chunk of the matrix.
145  
146  *     Blocksize cannot be too small :
147  *     If the matrix spans more than one processor , the following
148  *     restriction on NB , the size of each block on each processor ,
149  *     must hold :
150  *     NB >=(BWL + BWU) + 1
151  *     The bulk of parallel computation is done on the matrix of size
152  *     O(NB) on each processor. If this is too small , divide and conquer
153  *     is a poor choice of algorithm.
154  
155  *     Submatrix reference :
156  *     JA = IB
157  *     Alignment restriction that prevents unnecessary communication.
158  
159  *     === ==================================================================
160  
161  *     Notes
162  *     === ==
163  
164  *     If the factorization routine and the solve routine are to be called
165  *     separately(to solve various sets of righthand sides using the same
166  *     coefficient matrix) , the auxiliary space AF *must not be altered*
167  *     between calls to the factorization routine and the solve routine.
168  
169  *     The best algorithm for solving banded and tridiagonal linear systems
170  *     depends on a variety of parameters , especially the bandwidth.
171  *     Currently , only algorithms designed for the case N / P >> bw are
172  *     implemented. These go by many names , including Divide and Conquer ,
173  *     Partitioning , domain decomposition - type , etc.
174  
175  *     Algorithm description : Divide and Conquer
176  
177  *     The Divide and Conqer algorithm assumes the matrix is narrowly
178  *     banded compared with the number of equations. In this situation ,
179  *     it is best to distribute the input matrix A one - dimensionally ,
180  *     with columns atomic and rows divided amongst the processes.
181  *     The basic algorithm divides the banded matrix up into
182  *     P pieces with one stored on each processor ,
183  *     and then proceeds in 2 phases for the factorization or 3 for the
184  *     solution of a linear system.
185  *     1) Local Phase :
186  *     The individual pieces are factored independently and in
187  *     parallel. These factors are applied to the matrix creating
188  *     fillin , which is stored in a non - inspectable way in auxiliary
189  *     space AF. Mathematically , this is equivalent to reordering
190  *     the matrix A as P A P^T and then factoring the principal
191  *     leading submatrix of size equal to the sum of the sizes of
192  *     the matrices factored on each processor. The factors of
193  *     these submatrices overwrite the corresponding parts of A
194  *     in memory.
195  *     2) Reduced System Phase :
196  *     A small(max(bwl , bwu)* (P - 1)) system is formed representing
197  *     interaction of the larger blocks , and is stored(as are its
198  *     factors) in the space AF. A parallel Block Cyclic Reduction
199  *     algorithm is used. For a linear system , a parallel front solve
200  *     followed by an analagous backsolve , both using the structure
201  *     of the factored matrix , are performed.
202  *     3) Backsubsitution Phase :
203  *     For a linear system , a local backsubstitution is performed on
204  *     each processor in parallel.
205  
206  *     Descriptors
207  *     === ========
208  
209  *     Descriptors now have *types* and differ from ScaLAPACK 1.0.
210  
211  *     Note : banded codes can use either the old two dimensional
212  *     or new one - dimensional descriptors , though the processor grid in
213  *     both cases *must be one - dimensional*. We describe both types below.
214  
215  *     Each global data object is described by an associated description
216  *     vector. This vector stores the information required to establish
217  *     the mapping between an object element and its corresponding process
218  *     and memory location.
219  
220  *     Let A be a generic term for any 2D block cyclicly distributed array.
221  *     Such a global array has an associated description vector DESCA.
222  *     In the following comments , the character _ should be read as
223  *     "of the global array".
224  
225  *     NOTATION STORED IN EXPLANATION
226  *     --- ------------ -------------- --------------------------------------
227  *     DTYPE_A(global) DESCA( DTYPE_ )The descriptor type. In this case ,
228  *     DTYPE_A = 1.
229  *     CTXT_A(global) DESCA( CTXT_ ) The BLACS context handle , indicating
230  *     the BLACS process grid A is distribu -
231  *     ted over. The context itself is glo -
232  *     bal , but the handle(the integer
233  *     value) may vary.
234  *     M_A(global) DESCA( M_ ) The number of rows in the global
235  *     array A.
236  *     N_A(global) DESCA( N_ ) The number of columns in the global
237  *     array A.
238  *     MB_A(global) DESCA( MB_ ) The blocking factor used to distribute
239  *     the rows of the array.
240  *     NB_A(global) DESCA( NB_ ) The blocking factor used to distribute
241  *     the columns of the array.
242  *     RSRC_A(global) DESCA( RSRC_ ) The process row over which the first
243  *     row of the array A is distributed.
244  *     CSRC_A(global) DESCA( CSRC_ ) The process column over which the
245  *     first column of the array A is
246  *     distributed.
247  *     LLD_A(local) DESCA( LLD_ ) The leading dimension of the local
248  *     array. LLD_A >= MAX(1 , LOCr(M_A)).
249  
250  *     Let K be the number of rows or columns of a distributed matrix ,
251  *     and assume that its process grid has dimension p x q.
252  *     LOCr( K ) denotes the number of elements of K that a process
253  *     would receive if K were distributed over the p processes of its
254  *     process column.
255  *     Similarly , LOCc( K ) denotes the number of elements of K that a
256  *     process would receive if K were distributed over the q processes of
257  *     its process row.
258  *     The values of LOCr() and LOCc() may be determined via a call to the
259  *     ScaLAPACK tool function , NUMROC :
260  *     LOCr( M ) = NUMROC( M , MB_A , MYROW , RSRC_A , NPROW ) ,
261  *     LOCc( N ) = NUMROC( N , NB_A , MYCOL , CSRC_A , NPCOL ).
262  *     An upper bound for these quantities may be computed by :
263  *     LOCr( M ) <= ceil( ceil(M / MB_A) / NPROW )*MB_A
264  *     LOCc( N ) <= ceil( ceil(N / NB_A) / NPCOL )*NB_A
265  
266  *     One - dimensional descriptors :
267  
268  *     One - dimensional descriptors are a new addition to ScaLAPACK since
269  *     version 1.0. They simplify and shorten the descriptor for 1D
270  *     arrays.
271  
272  *     Since ScaLAPACK supports two - dimensional arrays as the fundamental
273  *     object , we allow 1D arrays to be distributed either over the
274  *     first dimension of the array(as if the grid were P - by - 1) or the
275  *     2nd dimension(as if the grid were 1 - by - P). This choice is
276  *     indicated by the descriptor type(501 or 502)
277  *     as described below.
278  
279  *     IMPORTANT NOTE : the actual BLACS grid represented by the
280  *     CTXT entry in the descriptor may be *either* P - by - 1 or 1 - by - P
281  *     irrespective of which one - dimensional descriptor type
282  *     (501 or 502) is input.
283  *     This routine will interpret the grid properly either way.
284  *     ScaLAPACK routines *do not support intercontext operations* so that
285  *     the grid passed to a single ScaLAPACK routine *must be the same*
286  *     for all array descriptors passed to that routine.
287  
288  *     NOTE : In all cases where 1D descriptors are used , 2D descriptors
289  *     may also be used , since a one - dimensional array is a special case
290  *     of a two - dimensional array with one dimension of size unity.
291  *     The two - dimensional array used in this case *must* be of the
292  *     proper orientation :
293  *     If the appropriate one - dimensional descriptor is DTYPEA = 501
294  *     (1 by P type) , then the two dimensional descriptor must
295  *     have a CTXT value that refers to a 1 by P BLACS grid ;
296  *     If the appropriate one - dimensional descriptor is DTYPEA = 502
297  *     (P by 1 type) , then the two dimensional descriptor must
298  *     have a CTXT value that refers to a P by 1 BLACS grid.
299  
300  *     Summary of allowed descriptors , types , and BLACS grids :
301  *     DTYPE 501 502 1 1
302  *     BLACS grid 1xP or Px1 1xP or Px1 1xP Px1
303  *     --- --------------------------------------------------
304  *     A               OK NO OK NO
305  *     B               NO OK NO OK
306  
307  *     Let A be a generic term for any 1D block cyclicly distributed array.
308  *     Such a global array has an associated description vector DESCA.
309  *     In the following comments , the character _ should be read as
310  *     "of the global array".
311  
312  *     NOTATION STORED IN EXPLANATION
313  *     --- ------------ ---------- ------------------------------------------
314  *     DTYPE_A(global) DESCA( 1 ) The descriptor type. For 1D grids ,
315  *     TYPE_A = 501 : 1 - by - P grid.
316  *     TYPE_A = 502 : P - by - 1 grid.
317  *     CTXT_A(global) DESCA( 2 ) The BLACS context handle , indicating
318  *     the BLACS process grid A is distribu -
319  *     ted over. The context itself is glo -
320  *     bal , but the handle(the integer
321  *     value) may vary.
322  *     N_A(global) DESCA( 3 ) The size of the array dimension being
323  *     distributed.
324  *     NB_A(global) DESCA( 4 ) The blocking factor used to distribute
325  *     the distributed dimension of the array.
326  *     SRC_A(global) DESCA( 5 ) The process row or column over which the
327  *     first row or column of the array
328  *     is distributed.
329  *     LLD_A(local) DESCA( 6 ) The leading dimension of the local array
330  *     storing the local blocks of the distri -
331  *     buted array A. Minimum value of LLD_A
332  *     depends on TYPE_A.
333  *     TYPE_A = 501 : LLD_A >=
334  *     size of undistributed dimension , 1.
335  *     TYPE_A = 502 : LLD_A >= NB_A , 1.
336  *     Reserved DESCA( 7 ) Reserved for future use.
337  
338  *     === ==================================================================
339  
340  *     Implemented for ScaLAPACK by :
341  *     Andrew J. Cleary , Livermore National Lab and University of Tenn. ,
342  *     and Markus Hegland , Australian Natonal University. Feb. , 1997.
343  *     Based on code written by : Peter Arbenz , ETH Zurich , 1996.
344  
345  *     === ==================================================================
346  
347  *     .. Parameters ..
348        DOUBLE PRECISION ONE , ZERO
349        PARAMETER( ONE = 1.0D + 0 )
350        PARAMETER( ZERO = 0.0D + 0 )
351        COMPLEX*16 CONE , CZERO
352        PARAMETER( CONE =( 1.0D + 0 , 0.0D + 0 ) )
353        PARAMETER( CZERO =( 0.0D + 0 , 0.0D + 0 ) )
354        INTEGER INT_ONE
355        PARAMETER( INT_ONE = 1 )
356        INTEGER DESCMULT , BIGNUM
357        PARAMETER( DESCMULT = 100 , BIGNUM = DESCMULT*DESCMULT )
358        INTEGER BLOCK_CYCLIC_2D , CSRC_ , CTXT_ , DLEN_ , DTYPE_ ,
359       $LLD_ , MB_ , M_ , NB_ , N_ , RSRC_
360        PARAMETER( BLOCK_CYCLIC_2D = 1 , DLEN_ = 9 , DTYPE_ = 1 ,
361       $CTXT_ = 2 , M_ = 3 , N_ = 4 , MB_ = 5 , NB_ = 6 ,
362       $RSRC_ = 7 , CSRC_ = 8 , LLD_ = 9 )
363        WORK( 1 ) = WORK_SIZE_MIN
364  
365  *     Make INFO consistent across processors
366  
367        CALL IGAMX2D( ICTXT , 'A' , ' ' , 1 , 1 , INFO , 1 , INFO , INFO ,
368       $- 1 , 0 , 0 )
369  
370        IF( MYCOL.EQ.0 ) THEN
371            CALL IGEBS2D( ICTXT , 'A' , ' ' , 1 , 1 , INFO , 1 )
372        ELSE
373            CALL IGEBR2D( ICTXT , 'A' , ' ' , 1 , 1 , INFO , 1 , 0 , 0 )
374        ENDIF
375  
376        RETURN
377  
378  *     End of PZGBTRF
379  
380        END
381