Expert Panel: No Single Point of Failure?

What it means, how it’s done, and how you decide what you need — a panel of three Tait experts share their knowledge on the subject.

FirefighterThe spectral efficiencies, interoperability and data transmission benefits of P25 are just part of the story; network operators are also working with much higher expectations of network robustness, tougher KPIs and risk of litigation. In public safety, near enough just isn’t good enough. That’s why Public Safety network operators now specify networks with no single point of failure. But what exactly does that mean?

Meet the experts:

 Russell Watson, Tait Communications  Steve Penny, Tait Communications  David Jenks, Tait Communications
Russell Watson
Solutions Marketing Manager—Public Safety
Steve Penny
P25 Architectural Design Engineer
David Jenks
Product Business Manager—P25 Infrastructure

What are the likely points of failure on my network, and what are the consequences?
Steve Penny: In order of increasing seriousness: When a base station fails, you lose capacity on that site. In a trunked network your radio users may not even notice, as the trunking controller will simply select another frequency for their calls. When a site controller fails, you lose that site. The impact of this varies, depending on your network design, and exactly where it occurs. It might take out all coverage over that area, or coverage may be picked up by an adjacent site so communications are maintained.

When the trunking node controller, central switch or IP network fails, all your sites and channels are lost and all your comms are down. What would that mean to your organization?

What’s the solution?
Russell Watson: In one word, redundancy. Once you understand and can calculate your acceptable level of risk, you can build in redundancy—duplicating (or triplicating) network elements to the level your network needs, and all levels below it. Clearly, this is not a new concept. Networks have commonly been protected against failure by duplicated “cold standby” equipment, traditionally specified to protect communications for disaster recovery. The downside of this is the delay involved in re-establishing communications. Cold standby equipment must be manually switched over, either remotely or directly. These days, hot standby provides automated switchover to the backup equipment, so you no longer need to send someone on-site to where the failure occurs.

Where should this redundancy occur in my network? What exactly can I protect?
David Jenks: Typically, a network might have redundancy at the subscriber level—mobiles and portables. Site equipment may also be duplicated, with redundant site controllers, RFSS controllers and network switches. Linking equipment can be arranged in a ring topology, so that there are always two possible paths back to the comms centre. Alternatively, you can retain your existing copper or microwave, and with one level of new microwave, fibre or MIMO linking, provide a fully redundant linking solution. OSPF (Open Shortest Path First) protocol means the network will always select the shortest route between the routers and switches, and if one path is unavailable, will use another.

Don’t forget about the possibility of a power outage. Loss of power is a significant threat to your network, so planning some redundancy at each level is critical. For example, you can preserve power to your server by specifying dual redundant power supplies, which allow the server to handle an internal power supply failure or be simultaneously sourced from two separate supplies. Base stations can also be powered from mains with a separate DC backup power source such as a battery system.

It sounds costly. Why should I specify this to my network provider?
Risk-Cost-ImpactRussell Watson: Well yes, it does cost, but you need to weigh this against the risk. What might network failure mean, in terms of operational continuity? Delayed responses to equipment failure or incidents? Lost time and the associated cost? Safety? Potential loss of life of officers or the public?

This is the most effective insurance you can buy. And like insurance, individual operators need to be well informed to analyze and assess the level of acceptable risk. From there, they will select a network provider who they can trust to ensure that risk is sufficiently mitigated, by designing and delivering a network design that incorporates the right level of robustness.

What about the P25 Open Standards? Is there a prescribed, standard approach from all vendors, or does each vendor develop their own solutions?
Steve Penny: The only failure protection defined in the P25 standard is “Fallback Mode” (similar to the widely used but proprietary “Failsoft Mode”) on the P25 base stations. This means that, in the event of a failure—most likely a site failure—the trunked base station will still process communications as if it were a conventional repeater.

The failures that invoke Fallback Mode typically mean that the site is cut off from the rest of the network, limiting comms in that area. Subscriber units are programmed to favour sites that are not in Fallback Mode, so they would typically end up registered on sites that are still part of the remaining network. The subscriber units that can “see” alternative sites will not notice the failure. Those that can only see the site in Fallback Mode will have very limited comms, and the network will have no visibility of them. However, in most digital networks Fallback Mode will rarely be necessary, because of higher levels of protection specified and built into the network, such as highly-available site controllers and linking redundancy. The actual solution is tailored to customer needs and very much depends on the individual levels of acceptable risk to each operator, and the experience and expertise of your network designers.

I can see that there are many ways to get the level of failure protection I need. What do I need to think about in terms of network design?
Steve Penny: It helps to think of this in terms of network survivability—what you are protecting against, rather than redundancy, which focuses on the equipment.

Once you have defined your level of risk, we can start to look at where the redundancy and protection is needed. For instance, if you specify 100% redundancy at controller level, you also need to state what you are protecting against. If you are concerned about technical failure, a co-located backup controller and alternative power source will meet that requirement.

However, if you need to protect against potential terrorist attack on your network, you need to specify geo-redundancy, where databases are distributed, mirrored and duplicated and redundant/backup critical infrastructure is located entirely separately.

You will need to work closely with your network provider. Use their insight and expertise to help define risk, and the appropriate response to it. This will come down to selecting your network elements for their redundancy features and capabilities.

If this fails… Then…
RFSS controller If there is only one RFSS controller, the site controllers at each site provide reduced trunking capability, known as site trunking. Local individual and group calls can be made. When communications with the RFSS controllers resume, normal function is restored, though subscriber units must all re-register.

If the RFSS controller is part of a cluster, the standby RFSS controller takes over automatically within one minute, once the configured number of heartbeat messages have failed. Current calls are dropped but information about SUs, location and group is retained.

If there is a major catastrophe at the RFSS controller location and the network has a geographically remote disaster recovery node, this can be activated locally or remotely. As the disaster recovery node replicates the database, information about SUs, their location and group is retained.

NMS Central monitoring of the network is unavailable but Operators can monitor the RFSS controller, site controllers and base stations. Syslog messages can be sent to syslog collectors.
Site controller If the site controller is standalone, the control channel base station enters failsoft mode, if configured to do so. When SUs lose contact with the site, they first hunt for another control channel and i unable to acquire one, they join failsoft mode.
If the site controller is part of a cluster, the inactive site controller takes over. Calls at the site will end but service continues as normal.
Server hardware If a server power supply fails, the second power supply takes over, with no degradation or interruption of service. If a server hard disk fails, the second disk drive continues to perform all disk activities. This affects performance but not in a way that is visible to network users.
Control channel The site controller automatically takes the channel out of service and raises an alarm. A traffic channel takes over as control channel. (Any traffic channel can take over. if it is carrying a call, that call is dropped.) Capacity is reduced by one channel (two, if a dual sub-rack fails).
Traffic channel Channel capacity is reduced by one (or two, if a full sub-rack fails). The site controller automatically takes the RF channel out of service.
Base station in a simulcast channel group The site controller degrades the “health” of the channel group by one. If the channel group is the control channel, another group takes over. If it is a traffic channel, other traffic channels with better “health” will be selected.
1 PPS timing signal In a simulcast system, if the external frequency reference loses its GPS timing, each base station at the physical site loses its 1 PPS timing signal but retains synchronization for a time by phase locking to the frequency reference.

If the frequency reference itself fails, the base stations are immediately un-synchronized. Base stations in a simulcast channel group can be configured to transmit or not to transmit after they are un-synchronized. If they do transmit, coverage and signal quality may be degraded in overlap zones.

Tait Connection MagazineThis article is taken from Connection Magazine, Edition 3. Connection is a collection of educational and thought-leading articles focusing on critical communications, wireless and radio technology.

Share your views, comments and suggestions in the Tait Connection Magazine LinkedIn group.

Leave a Reply

Your email address will not be published.