The VoIP Problems
benefits of packetized voice outweigh the disadvantages, it should
be noted that these disadvantages have contributed to the slow
adoption of the technology. Among the disadvantages are:
The data networks have had difficulty providing
the Quality of Service (QoS) and high enough reliability customers
For any given compression algorithm, it takes
extra bytes for voice packets. The TCP and IP headers have
additional bytes not normally needed in TDM voice systems.
To get lower bandwidth, the voice compression
algorithms and echo cancellation requires additional processing
power that makes digital phones more expensive than analog
As a practical matter, VoIP can't be deployed
instantly everywhere. There are many complexities required
for connecting VoIP networks with the existing phone networks.
These complexities can be costly and have required standardization.
The presence of certain types of network equipment
may make it difficult to establish VoIP calls. NATs make it
difficult to locate the destination, local IP addresses may
not be useable in the public IP network, and firewalls can
keep voice traffic from getting through the network.
Quality of Service
Quality of Service (QoS) means having a conversation with quality
of today's public phone system. The factors that go into QoS are:
Low delay (latency) from end-to-end. Long delays
(>400 milliseconds) make conversations difficult.
Accurate delivery of the data. This means the
path is predictable and doesn't have a large variation in
the arrival time of the packets. If there is a large variation,
then packets need to be buffered to avoid the times of not
having data. These buffers add to the end-to-end delay time
that needs to be kept to a minimum.
Low packet loss. Since a packet contains 10 milliseconds
of data, the loss of a packet is much more severe a problem
then the loss of one byte of data when an error in the network
Having enough bandwidth to send the data. When
data networks are used for transmitting real-time data packets
such as voice, enough bandwidth needs to be available on each
link to ensure that these packets can get through.
Minimizing the errors in the speech coding algorithm.
There are many factors in VoIP systems
that contribute to high latency. As a point of reference, the
current public phone system delay is kept under 10 milliseconds.
When the delay approaches 50 milliseconds, echo cancellers need
to be inserted into the voice path. People tolerate up to 250
milliseconds of delay. Earlier implementations of VoIP had 400
millisecond of delay and 77% found this delay to be unacceptable.
Serialization - the time to sample the data
Algorithm computation - the time to examine the
data and compress it
Jitter buffering - variations of arrival time
require that enough data be available until the next packet
Delay causes two problems: echo and talker
overlap. Echo is caused by the signal being reflected back from
the listeners equipment and that delayed signal is fed back into
the speaker's ear. Since echo is perceived as a significant quality
problem, Voice over Packet systems must insert echo cancellers
in the path of calls that exceed the 50-millisecond delay.
Talker overlap (or the problem of one talker stepping on the other
talker's speech) becomes significant if the one-way delay becomes
greater than 250 msec. The end-to-end delay budget is therefore
the major constraint and driving requirement for reducing delay
through a packet network.
VoIP packet Overhead
The two conflicting goals of minimizing
delay and removing jitter have required various schemes to adapt
the jitter buffer size to match the time varying requirements
of network jitter removal. This adaptation has the explicit goal
of minimizing the size and delay of the jitter buffer, while at
the same time preventing buffer underflow caused by jitter.
Trying to choose the proper number of voice bytes to send in an
IP packet is very difficult because it becomes a tradeoff of network
bandwidth vs. end-to-end delay. In addition to the voice data,
every VoIP packet transmitted has at least 40 extra bytes required
for the TCP, IP, and RTP headers. If the packet size is small
so that there is short voice delay, then the 40-byte header becomes
a significant drain on the network bandwidth.
The effects of keeping the packetization delay to 10ms are shown
in Table 1. For example, using the G.729A coding, there are 10
bytes of voice in the 50 byte packet being sent. The resulting
40 kbits/sec of network bandwidth shows very little benefit of
the additional compression complexity of generating the 8 kbits/sec
voice coding rate.
Table 1. Voice Coding Data Rates
To help reduce the overall bandwidth,
an algorithm is implemented to detect periods of speech silence
and eliminate sending of the data. One of the simplest voice signal
detectors is Voice Activity Detector (VAD) that detects active
speech and pauses the input signal when there is an appropriate
silence interval and VAD substitutes low bandwidth noise.
Figure 1 illustrates the simplest VAD
mechanism. The input speech signal is compared with some threshold
value. In case the amplitude is more than the threshold (the red
line), the data from the T2 to the T3 interval is sent. If it's
less (i.e. in T1 v T2 interval), then low volume/low bandwidth
noise is sent.
Figure 1. Example of how VAD works
In addition to VAD, there are several
methods of reducing the size of the header and thus reducing the
bandwidth. Many of the header bytes do not change from one packet
to the next on each individual data stream. Knowing that many
bytes do not change, the Van Jacobson header compression algorithm
is sometimes used to reduce the size of the header..
In data networks, there is often a need
to have one external IP address be used by many IP addresses inside
a company. For the home network, a family may buy a single cable
or DSL modem and want that connection to be used by all the computers
in the home. A Network Address Translator (NAT) is a device often
contained within a router that performs that function.
For many applications, the presence of a NAT does not cause a
problem. For VoIP though, the presence of a NAT is problematic.
The problem arises when a call originating externally tries to
place a call to the IP address assigned to the home. Since this
IP address should be a single device, where should the NAT translate
the address (e.g. which of the six computers should receive the
Another network control device that can cause a problem is a firewall
present on many networks. These devices are very necessary to
protect the premise networks from hackers on the outside. One
of the ways the firewall protects the premise network is that
unless a device inside is expecting to receive data from a particular
external address, the firewall will refuse the connection coming
from the outside.
For someone trying to call a phone on the inside of the protected
network, this is problem. Many of the calls initiate from the
outside are unexpected and thus the internal phone can't notify
the firewall that this incoming connection should be allowed.
between the VoIP network and the Public Switched Telephone Network
(PSTN) needs to link two very different types of networks together.
Some of the important tasks required of this gateway device are
Map 10 digit phone numbers to IP addresses (and
Translate PSTN networking signaling to VoIP messages
(and visa versa).
Translate VoIP compressed voice to PSTN digital/analog
If the call is from the PSTN to the VoIP network,
the gateway needs to establish a connection to the destination
that meets the QoS of the call and ensure that the underlying
data network will support the requested level of service.
Exchange information with the PSTN that meets
the needs of emergency services (e.g. 911), Federal Security
guidelines (i.e. wiretap info), and billing systems (i.e.
Call Detail Records).
There are also physical link mappings. For example
to take many VoIP calls and put them on a large capacity PSTN
trunk line, the interface is typically a GR303. In other cases,
it may be a simpler T1 interface or, in its simplest form,
a single analog telephone line.
Data networks work on addresses. In IP
systems, there is often a name-to-address translation system (DNS)
that can assist in taking names (i.e. www.microsoft.com) and translating
them to the IP address (i.e. 18.104.22.168). In going from a
VoIP network to the PSTN, the destination for the call can be
a phone on the PSTN or it may be another VoIP phone that requires
the use of the PSTN to get to the VoIP phone.
Should the call require using the long distance services of the
PSTN, calling information and duration of the call need to be
recorded so that accurate billing records can be kept and provided
back to the caller.
Once the mappings between the PSTN number
and IP address are resolved, the gateway initiates a message requesting
to start a call. As this signaling message goes through the system,
it may need to be rerouted. For example, people subscribe to call
VoIP call progress indicators need to be sent back to the originator.
The special "fast busy" sound indicates the network
is congested and a path to the destination is not currently available.
The destination may already be on a call and a busy signal might
be returned. In another case, the destination may have call waiting
and a false ringing signal needs to be returned until the destination
decides to answer it or perhaps goes to a voice mail system. While
waiting, the caller ID information needs to be sent to the destination.
When the call is terminated, another set of messages needs to
be exchanged to tear down the PSTN connection and the data network
connection. Any Call Detail Records (CDR) need to have the information
for this call logged for billing purposes.
It is also possible to have many combinations of multi-way calling.
Any mixture of PSTN and VoIP phones is possible. In the analog
world, this function can be done by adding all the analog signals
together and sending them back down the line. VoIP bridging can
be more involved when various compression algorithms are used
at each phone.
The major approaches in implementing
these functions in a system are to use gateway protocols between
network devices. The major gateway protocols are H.323 (the video
conferencing standard), MGCP (Media Gateway Control Protocol),
and SIP (Session Initiation Protocol).
The most widely embraced standard is the ITU-T's H.323 umbrella
standard. Originally designed as an end-to-end communications
standard for videoconferencing over packet networks, H.323 was
adapted for voice-over-IP applications. The result was a standard
that defines far more functionality than is necessary for most
voice-over-IP environments. Its complexity is hard to implement
efficiently and causes problems in interoperably since there are
various ways to interpret the standard.
MGCP is a protocol that addresses control of media gateways, but
it does not , as H.323 does, specify a complete end-to-end communication.
MGCP uses simple endpoints called media gateways (MGs). An intelligent
media gateway controller (MGC) or call agent (CA) provides services.
The endpoint provides user interactions and interfaces, while
the MGC provides centralized call intelligence. A master/slave
relationship is preserved at all times between the MGC and the
MGs. In fact, all changes of state are forwarded to the MGC via
a series of relatively simple messages. The MG can then execute
simple actions based on commands from the MGC.
When the user goes off-hook, the gateway notifies the MGC, which
then instructs the MG to play the dial tone. When the user enters
digits (DTMF) to dial a number, each digit is relayed to the MGC
individually because the MG has no concept of a dial plan. It
doesn't know when the user has dialed enough digits to complete
a call. In a sense, the MG becomes a logical extension of the
Typically, MGCP messages are sent over IP/UDP between the MG and
the MGC. The media connection (voice path) itself is usually over
IP/RTP. For security, MGCP uses IPSec to protect the signaling
SIP is an application-layer signaling protocol that specifies
call control for multiparty sessions, IP phone calls or multimedia
distribution. Unlike H.323, which is based on binary encoding,
SIP is a text-based protocol that is much easier to implement.
Much like H.323, SIP is a peer-to-peer architecture (vs. master/slave
SIP depends on relatively intelligent endpoints, which require
little or no interaction with servers. Each endpoint manages its
own signaling, both to the user and to other endpoints. SIP is
more scalable than H.323 because it is inherently a distributed
and a stateless call model.
Perhaps the key advantage of SIP is that it is an Internet-model
protocol from inception. It uses simple ASCII messaging based
on HTTP/1.1. This means that SIP messaging is easy to decode and
troubleshoot. Web-type applications can support SIP services with
minimal changes. In fact, SIP fully supports URL (with DNS) naming
in addition to standard E.164 North American Numbering Plan addressing.
That means that in a SIP model, a user's e-mail address and phone
address can be the same. It also means that the session is abstracted
so that very different endpoints can communicate with each other.
Additional VoIP seminars:
An Introduction to VoIP - An overview of the VoIP technology,
architecture, and the interconnection issues.
Voice Coding Algorithms - A description of the various
methods for digitizing speech.
VoIP Applications - The VoIP technology only becomes
useful when compelling applications meet the needs of customers.
The corporate, cable telephony, and video conferencing applications
Existing data networks have had difficulty providing
the Quality and reliability customers expect.
Packet technology requires additional bytes for
the packet headers which increases bandwidth.
Interconnection with the existing phone networks
Firewalls and NATs make it difficult to have