Monitoring System Help

So, you want to know what this does and how?


Most monitoring systems use a central server that performs checks. This model doesn't work well with RADIUS because of the trust relationships that have to be made between pairs of systems.

Background

The basic premise of this system is that all checks are being done from the actual RADIUS servers which proxy requests. This approach means that there's a high degree of confidence that the test is representative of reality.

The most common problems seen with configuration are around the trust relationships and firewalling. These two are very specific to the path between two particular machines so it makes no sense to perform the tests from a third system.

What happens here is that there is a monitoring system running, independently, one each of the four NRPS. Each uses its own set of RADIUS configuration to populate the monitoring system. Then they run their checks against each server.

There are four NRPS so each Ping test for a particular site will be 4 x . e.g. if you have 3 ORPS, that's 12 separate tests. The same applies for each test.

Key:

  • Red text indicates that there is a problem e.g. Host Down, or a breach of the Tech Spec e.g. ICMP Blocked. These need rectifying as soon as possible.
  • Yellow text indicates that something isn't quite right but isn't necessarily a problem that needs fixing. For instance, attribute checks often show that the RFC syntax isn't being met, but this isn't necessarily an impediment to the service.
  • Green text indicates no problems and the Tech Spec is being met.
  • Blue text indicates that there's not enough information to make a decision on.

Per host checks

There are number of tests which have been build specifically to spot issues in different places.

Ping

Or ICMP echo/response. The most basic network check that simply checks if the server sends back a response to a ICMP request. We'll see how this is used later.

RADIUS Port

This amounts to a scan of just the RADIUS port (1812/udp) and can tell if the port is open or not. Things aren't quite as simple 'Open means alive' but, again, we'll see how it's used later.

Simple Authentication

This test doesn't expect to get an Access-Acept. It sends a simple (non-EAP) authentication with a generic username (jisctest) and password. If the server responds with anything at all (normally a Reject) then it's an indication that there's a working RADIUS server present.

Status Server

Available on some RADIUS servers this test effectively replaces the Simple Authentication one. Status Server can be used as a 'RADIUS Ping' test.

Local Authentication

Similar to Simple Authentication but, instead, uses a username and password supplied by the site and indicates a proper configuration

Server Shared Secret

A check is done in the logs to see if there are any signs that the sending client (ORPS or RRPS) has the wrong shared secret for the NRPS.

Zombie

A check is done in the logs to see if there are any signs of ORPS being marked as 'down' (aka a Zombie) because of a lack of response to an authentication request.

User in Realm *

Testing RADIUS servers beyond directly connected ones is impossible but it is possible to use a supplied username and password and perform an EAP authentication tests.

Certificate in Realm *

The same test as User in Realm * above but this time the check is on the certificate(s) being sent by the IdP.

A number of tests are done to see if the certificates are valid, contain the appropriate attributes for govroam, have suitable encryption etc.

Tunnel Type in Realm *

The same test as User in Realm * above but this time checks are done on the returned response to see if the 'Tunnel-Type' attribute is present. See VLAN Check for details.

Per realm checks

These tests look at the logs of incoming traffic to determine any issues within realms.

Called Station ID Checks

Called Station ID is an attribute sent in a Request packet identifying which device the client is communicating with and which SSID it's talking to.

The device is identified by its MAC address and must be in the format of AA-BB-CC-DD-EE-FF (separated by '-' not ':' or anything else). The SSID should, for Govroam, ALWAYS be 'govroam'. All lower case. This is vital because devices will only automatically connect to the SSID of 'govroam' anything else will be treated as a different network.

Calling Station ID Checks

Calling Station ID is an attribute sent in a request packet identifying the client making the request. This is used for audit and to de-duplicate authentication attempts in logs.

The device is identified by its MAC address and must be in the format of AA-BB-CC-DD-EE-FF (separated by '-' not ':' or anything else).

Operator Checks

If a site is capable of adding the Operator-Name attribute then they ought to. It's very useful in audit trails, abuse cases and for stats gathering. It's the only way to really identify the site sending auth requests.

The value of the attribute should be in the format "1". For 'holby.nhs.uk' that would be '1holby.nhs.uk'. The '1' identifies that what follows is a unique realm. There are others but they aren't relevant here.

Realm Syntax Checks

Sites forward all authentication requests for unknown realms to Jisc by default. However, we ask that realms that aren't syntactically correct aren't proxied. Users enter this information into their devices so typos are common.

What should be filtered out.

VLAN Checks

Sites often will add Tunnel attributes (Tunnel-Type etc.) to responses which force their wireless system to place devices on specific VLANs. A good idea. However, if the response is being sent to a remote site (for one of their users) then it's dangerous. If the remote site accepts the response and also uses these attributes to assign VLANs then confusion will ensue.

So, it's really important to put filters in place to remove these attributes as they leave your site and as they enter.

How are these tests used?

Using 'Decision Tree'-like algorithms it's possible to make inferences from the results of the tests.

For instance, if the site is Full member then the tests performed will be 'Ping', 'RADIUS port' and 'Simple Authentication'. If they're all successful then we're good. If Ping fails but the rest succeed then there's a very good chance that the server is up, the RADIUS server works and that ICMP is being blocked by a firewall. If none work then there's a good chance that the server is down or just not routeable.

We're building a set of Issues from these results, taking into account as much information about the systems as we can. Some of the Issues (like ICMP Blocked) would need fixing because they conflict with the requirements of the Tech Spec. Others, like no response to a Simple Auth test aren't required but could indicate something wrong and it would be better if they were configured to succeed (in lieu of a Status Server check).

FAQ

Why can't I see the IP/Hostname of my servers?

This site is public so we didn't want to publicise these details. Generally the host identifiers e.g. holby-nhs-uk-0 are in the same order as they were submitted to us. If you need to know which is which then email us at govroam@jisc.ac.uk.

What's the difference between Client and Server Shared Secrets?

RADIUS software is configured to have a trust relationship between two hosts so that a RADIUS client can communicate securely with a RADIUS server. In this explanation a 'client' is the system sending an authentication request and a 'server' is the system responding.

This client/server relationship is the key because ORPS and NRPS will be both. When an ORPS proxies a request to the NRPS, it's a client; when a ORPS receives a request from a NRPS, it's a server.

Technically, it's possible to have different shared secrets for each 'direction' i.e. a different shared secret for an ORPS when it's a server than when it's a client (for a particular NRPS).

In practice, we keep the shared secret the same for ease (and sanity). However, it's feasible that a typo, a bad cut and paste or a partial change could mean that they're different at one end.

In this scenario different tests are needed to spot the problem. If an ORPS has a different shared secret to the NRPS when sending logs can be checked at this end. If an ORPS has a different shared secret when receiving... well this looks the same to the NRPS as if the RADIUS software on the ORPS is down, or the host is down, or there's no route, or a number of different situations.

Thus there's a test specifically for a bad Server Shared Secret, and the Client Shared Secret has to be infered from a number of tests.