|
Real-time Status Monitoring
15.1 Introduction
NetVigil offers two types of reporting, real-time status reports and periodic trend reports. Immediately upon login to the NetVigil system, by default you are able to see the real-time status of your monitored devices on the Device Summary page. You are able to see any current failures and performance losses instantly. In a single click NetVigil provides you with test details on any monitored device, a 24 hour graphical snapshot of performance and event history, and test results for the last 30 days.
15.1.1 NetVigil Terms
NetVigil monitors the availability and performance of your network and application systems, and their underlying components. These systems and components may be routers, switches, servers, databases, networks, or applications.
A test is the measure of device functioning. Tests are used to monitor your devices. NetVigil reports the status of each test. Test status (shown on the
Status | Testspage) is the current status category (ok, warning, critical, unknown, unreachable, suspended, or not configured) for a test. Device status (shown on theStatus | Devicepage) is the worst current test status for a device.NetVigil uses boundaries called thresholds to determine a test's status. An threshold violation occurs whenever a test result crosses a threshold.
An action is an activity that is automatically triggered by a threshold violation. Actions can be designed to take place immediately when a single violation occurs or after the same violation occurs repeatedly. For instance, an E-mail notification can be sent whenever a test crosses the warning threshold, or it can be sent after a test has crossed the warning threshold five consecutive times.
15.2 NetVigil Status View
Figure 15.1 displays the NetVigil icons used to display device and test status.
Test status can be UNKNOWN for one of several reasons:
- When a new test is added to a device, the provisioning database notifies the relevant DGE. When the DGE gets this notification, it retrieves the test details from the provisioning database and schedules the test for monitoring. A scheduled test is added to a queue at the interval configured for the test. Items from the queues are tested (often in large batches) in sequential order. Depending on how many tests you have configured on a particular DGE/location, a newly added test may remain in the queue for seconds or minutes. While the test is waiting in the queue, the Web Application shows UNKNOWN state for the test, since it does not yet have any polled results to display.
- Some tests do a rate calculation ([result1 - result2] / time_elapsed_between_tests), which requires two polled results. For example, most network interface tests (Traffic In/Out, Util In/Out) are in this category. Until the second result is polled, these tests show UNKNOWN state. If a test is configured for a five-minute polling interval, it remains in UNKNOWN state for approximately ten minutes, until two results are received and the rate is calculated.
- If the flap-prevention feature is enabled (configured via etc/dge.xml) any test that is in the process of changing it's state will show TRANSIENT state for the configured cycles. For example, if flap-prevention cycle is configured to be 2, and a ping test is configured for 3 minute interval, when the ping test switches from OK state to WARNING, until the test remains in the new state for 2 additional cycles (6 min), on the Web Application the test will be shown in TRANSIENT state.
- If the DGE process is not running, since newly added tests have not been polled for any results, they are shown in UNKNOWN state. When you drill down into devices with older tests, they will show values under TEST TIME and DURATION columns in a light blue color, indicating outdated results
- If a device is not reachable (e.g., it's been turned off, there are network problems, etc.), tests for that device appear in UNKNOWN state, indicating that no polled value could be retrieved.
- In the case of SNMP tests, if the OID is no longer valid (ifIndex has changed), the test appears in UNKNOWN state, indicating that no polled value could be retrieved.
Although not represented by a particular icon, a test can have a status of FAIL, which means that the device was reached but the test failed to be performed. An example is when a POP3 port test is performed and the supplied login/password combination fails. This is monitor dependent.
Figure 15.1 NetVigil Symbols Used to Report StatusTest Timeouts
If a standard test does not return a result within a certain timeout interval, test status is FAILED. There are three types of timeouts:
The timeout value is always the same (e.g., 10 seconds).
The timeout value changes depending on some user-configured value (e.g, threshold + 5 seconds).
The value is specified in a configuration file and does not frequently change.
15.2.1 Device Status Summary View
- To view the Device Summary for your department, do one of the following:
- Log in to NetVigil. You will be taken to the Device Summary page for your department.
- If you are already logged in, click on the STATUS tab and your Device Summary or Container Summary screen will load (depending on your user preferences).
The Device Summary View is the default view after the STATUS tab is selected. There is one row for each device in your department that is being monitored. Each row gives the device name and the status for each of three categories of tests: Network, System, and Application.
The
modify icon links you to a page for modifying a device's settings.
If the device status for one group of tests is warning, at least one current test result for that test category is in warning range. Similarly, if the device status for one category of tests is critical, at least one current test result for that group is in critical range. The worst test status of all tests in the category determines the icon displayed. The rule for displaying the icons (from most to least severe) is:
Tests are sorted by severity and time-in-state while devices/containers are sorted by severity and then alphabetically.
A sample Device Status Summary page is shown below.
Figure 15.2 Device Summary Page
- To modify the settings for a device:
- Click on the STATUS tab on the main navigation bar to go to the Status Summary page
- Click on the
modify icon for the desired device and you will be taken to the Update Device page.
- See the Section 16.1, "Adding Devices For Monitoring" on page 199 for instructions on managing devices.
15.2.2 Test Summary View
The Test Summary page contains one row for each test being conducted. Each row contains test status, test name, current test value, the warning and critical thresholds, the time the last test was conducted, and the time the test has remained in the current state.
Figure 15.3 Test Summary PageClick on the STATUS tab on the main navigation bar to go to the Device Summary page. Click on the device name link for the device of interest and you will be taken to the Device Status Details page.
15.2.3 Test Details View
The Test Details page graphically displays performance and threshold violation history for a single test over the last 6-24 hours. Figure 5 below illustrates the four graphs on the Test Details page:
By default, the date/time displayed on the upper-left hand corner of the Web Application is based on GMT. You can change the time zone used for displaying date/time from your user preference settings:
Now the date and time should reflect your local date and time
Figure 15.4 Test Details Page
- To view the details for one specific test:
- Click on the STATUS tab on the main navigation bar to go to the Device Summary page.
- Click on the device name link for the device of interest and you will be taken to the Test Summary page.
- Click on the test name link for the test of interest and you will be taken to the Test Details page for that test.
From the Test Details page, users also have access to the following information for that test:
This behavior is by design. When you update some test parameters, a notification is sent to the DGE that is performing this test. Upon receiving the notification, it reloads the (modified) configuration information and re-schedules the test, which should cause the test to be executed almost immediately.
While the DGE is updating it's information and re-scheduling the test, the Web Application continues to use the threshold values that were used for the last time the test was executed. The reason behind this is to avoid confusion. Lets assume in the last pass, the warning threshold was 10, critical threshold was 20, and the polled result was 15. So the Web Application would show a warning state for the particular test. Now you have changed the warning and critical thresholds to 25 and 50. If the new threshold values are displayed immediately (before the test has been executed again), on the Web Application you would expect to see the test in ok state with polled result 15, while the thresholds are 25/50, yet the test remains in warning state. To avoid such a scenario, the threshold values for the last polled result is displayed until the test is executed again.
15.2.4 Service Container Summary View
The Service Container Summary View, available via
STATUS | containersdisplays the consolidated view of logical systems or applications by grouping together tests, devices or nested containers. The status of a container is the `worst' of any of the its components. Hence, if any device or nested container within a container turns critical, the status will `bubble up' and turn the top level container to also be critical.In addition to viewing the real-time status of Service Containers, you can generate reports on containers which tell you the downtime, which element caused a container to be unavailable, etc.
Please see Section 18.1, "Overview" on page 241 for more detailed information.
15.2.5 Device Display Filters
Via NetVigil's device summary views (i.e. Device Summary and Device Groups Summary pages), users can set default filters in order to only view devices in specific states. For example, users may elect to filter out devices that are in an 'OK' status. Additionally, users can specify how many devices are displayed on a single page. Especially for large deployments, these two features can dramatically cut down on the number of entries a user must scroll through to get a quick snapshot of system health. A toggle switch on the Device Summary & Device Groups Summary pages quickly disables or enables the filter(s).
- To set device filter and paging preferences:
- Click
MANAGE | prefs.- Select the device states you want to view on the summary pages. Device states that are not selected are filtered out by default.
- Change the number of devices to view on each page in the Maximum Summary Screen field.
- When you have finished configuring preferences, click Update User to save your changes. These changes become part of your user profile and serve as defaults each time you log in to NetVigil.
The device/test filter was built with support for full Perl5 regular-expressions. The entered text is used for a literal pattern match (instead of a sub-string match), so you enter a partial device name, according to perl5 regexp, no match. In order to display filtered results, you will need to enter Perl5 compatible patterns. For example:
15.2.6 Device Comment Field
A user can enter a comment that will display on the Device Summary page. This could be used in any way by the user to communicate device-specific information, such as to identify why a device is being suspended or as general information on the current state of the device.
- To create a comment for a specific device:
- Click on the MANAGE tab on the main navigation bar to go to the Manage Devices page.
- Click on the Comments link for the device of interest and you will be taken to an Update Device page.
- Add the comments and click the Update Device button to save changes (this can also be accomplished when suspending a device).
- Navigate to the Device Summary page and confirm that the comment appears for the device you updated.
15.2.7 Context-sensitive Help or Action
NetVigil's Test Summary view displays a HELP link used to provide context-sensitive help to users. Selecting the link displays a pop-up window with information configured by your administrator or operations personnel to address device or test help topics. Although completely customizable, one suggested use of this functionality is to provide online help documentation for a specific device or test in the absence of senior administration personnel (e.g. nighttime operations).
An alternative to providing text based help, is to enable an action (e.g. server re-start) via the HELP link. This is a powerful option, as an administrator can configure any number of files to work in this fashion, enabling a large number of background processes via the web app. Please contact your NetVigil administrator for details of how the functionality is being deployed in your organization.
15.3 Threshold Violation Reports
A Threshold Violation report shows every time a test status has changed state in the past 24 hours. Each entry gives the device name, time the event occurred, test name, type of test, low (warning) and high (critical) thresholds, and the actual test value. This report can be viewed in aggregate for all devices and tests on a department, or may be viewed in a filtered manner for a specific device or test.
- To view the Threshold Violations for all devices:
- Go to
Reports->Customon the main menu.- Click on the Event Log report.
- To view the Threshold Violations for only one device:
- Click
STATUS | devices.- Click a device to view its Test Summary page.
- Click the threshold violations link to see the past 24 hours of violations for the device.
- To view the Threshold Violation for only one test:
- Click
STATUS | tests.- Click a test to view its Test Details page.
- Click Events for the last 24 hours to see events for only that test.
15.4 Messages & Traps
Messages are text alerts are generated by syslogs, Windows logs or SNMP traps. These are matched based on the text patterns and trigger an action and is also recorded in the Event Manager window.
For details on how to setup log file and trap monitoring, see Chapter 7, "Message Handler for Traps & Logs" on page 85.
|
Fidelia Technology, Inc. NetVigil v4.0 www.fidelia.com |