Pro2 Monitoring

Protop can monitor Pro2 SQL replication when the Pro2Queues data collector is added to ptInitDC. Refer to the list of alertable metrics for Pro2 Queues to see which fields can be monitored and alerted.

NOTICE: If you have previously used the "Pro2Activity" data collector for alerting you should change to use "Pro2Queues". The "Pro2Activity" data collector will be removed in a future release.

Configuration steps:

  1. Edit etc/pt3agent.[*].cfg
  2. add “Pro2Queues” to ptInitDC
  3. Restart the agent by removing tmp/pt3agent.[*].flg. The dbmonitor will restart it shortly.

Verification:

To verify that “Pro2Queues” is in use look for 3 things in the logs - they can all be checked at once with this command on Unix (use FINDSTR on Windows):

grep -i pro2q log/pt3agent.friendlyName.log | more

2021/06/19 00:01:56.998-04:00 friendlyName 60 DBId,Dashboard,Configuration,TableActivity,IndexActivity,LatchActivity,ResourceWaits,StorageAreas,RemoteServerActivity,ReplAgent,Blocked,ActiveTRX,UserIOActivity,df,OSInfo,Pro2Queues

2021/06/19 00:01:57.483-04:00 dc/pro2qmon.p has been initialized as Pro2Queues

2021/06/19 00:02:25.959-04:00 postData: pro2q 5 records, 6 lines, length= 401
. . .

The final line is hopefully repeated many times. That is the line in the log file showing that the pro2q data is actually going to the portal.


Background:

Procedure dc/pro2mon.p is the old data collector that was previously bound to the “2" key in ProTop and known as "Pro2Activity" (pt3agent.p can still use it but protop.p cannot). Procedure dc/pro2qmon.p is the new data collector bound to "2" and known to pt3agent as "Pro2Queues”.

Here's an example of the Pro2 Queue panel brought up when the "2" key is invoked in ProTop. Each column is described below.

pro2 panel

Each queue (QNum) is monitored individually.

The only known Status is “Running”.

Possible values of Action are:

The Enabled/Disabled/Paused and Orphans columns correspond to the number of tables classified as such.

Orphaned means that we found a record for a table in the queue that is NOT listed as belonging to that queue - this might happen if someone changes their mind and decides to remove a table from Pro2. That can disrupt the queue record counts and processing so if it happens it needs to be addressed - thus there is an alarm for orphans > 0 (see beow).

If you do have “orphan” replQueue records you can add PRO2QSKIP2SEQ=n to you bin/localenv file. Where "n" is determined by running ad-hoc queries on replQueue. This will skip past the orphans and start counting at the next sequence beyond the orphans.

Depth is the number of records in that queue that are waiting to be processed. It is calculated in one of 3 ways, the single character column to the right tells you how it was calculated:

The Queue Lag is the age of the oldest record in the queue. By default we alert at 30 minutes and alarm at one hour (see below). Some users may prefer more aggressive alerts for lag time. Use zLagTime as the alert metric, zLagTime is seconds as an integer rather than an hh:mm:ss string and, therefore, is much easier to write alerts for. If the oldest record is more than a day old the number of days will appear to the left of the hh:mm:ss, zLagTime will be a very large number of seconds. Don’t worry, zLagTime is an int64 ;)

At times very small queue depths may show a lag of zero (or blank) and an oldest table of n/a. This is because the queue catches up while ProTop is in the process of collecting data about it. This is not considered to be a problem.

The Oldest Table in the queue is just for information purposes. It is unlikely to make any sense to want to alert on that but does tend to give a sense of what table’s records a queue is processing.

Ditto the Source DB. The source db name is the "Pro2 name” for the db. It is not the ProTop friendly name nor the db physical name. This name might be handy if you need to talk to a Pro2 admin or use the Pro2 admin console.

The Pro2 schema is not in every database and the replicated data might come from multiple databases. So the data collector uses dynamic queries and it will complain politely if you try to run it in a databases that does not support Pro2:

no pro2 schema

Default alerts in etc/alert.cfg:

rq_pausedTbls  num   >         0  ""  "daily" "&1 &2 &3" alert
rq_orphanTbls  num   >         0  ""  "daily" "&1 &2 &3" alarm

rq_qStatus     char <> "Running"  ""  "hourly" "&1 &2 &3" alert

rq_Depth       num   >    100000  ""  "hourly" "&1 &2 &3" alert
rq_Depth       num   >   1000000  ""  "hourly" "&1 &2 &3" alarm

zLagTime       num   >      1800  ""  "hourly" "&1 &2 &3" alert
zLagTime       num   >      3600  ""  "hourly" "&1 &2 &3" alarm