Title - Voice over IP (VoIP) Problems

The VoIP Problems

While the benefits of packetized voice outweigh the disadvantages, it should be noted that these disadvantages have contributed to the slow adoption of the technology. Among the disadvantages are:

  • The data networks have had difficulty providing the Quality of Service (QoS) and high enough reliability customers expect.

  • For any given compression algorithm, it takes extra bytes for voice packets. The TCP and IP headers have additional bytes not normally needed in TDM voice systems.

  • To get lower bandwidth, the voice compression algorithms and echo cancellation requires additional processing power that makes digital phones more expensive than analog phones.

  • As a practical matter, VoIP can't be deployed instantly everywhere. There are many complexities required for connecting VoIP networks with the existing phone networks. These complexities can be costly and have required standardization.

  • The presence of certain types of network equipment may make it difficult to establish VoIP calls. NATs make it difficult to locate the destination, local IP addresses may not be useable in the public IP network, and firewalls can keep voice traffic from getting through the network.

Quality of Service

For VoIP, Quality of Service (QoS) means having a conversation with quality of today's public phone system. The factors that go into QoS are:

  • Low delay (latency) from end-to-end. Long delays (>400 milliseconds) make conversations difficult.

  • Accurate delivery of the data. This means the path is predictable and doesn't have a large variation in the arrival time of the packets. If there is a large variation, then packets need to be buffered to avoid the times of not having data. These buffers add to the end-to-end delay time that needs to be kept to a minimum.

  • Low packet loss. Since a packet contains 10 milliseconds of data, the loss of a packet is much more severe a problem then the loss of one byte of data when an error in the network occurs.

  • Having enough bandwidth to send the data. When data networks are used for transmitting real-time data packets such as voice, enough bandwidth needs to be available on each link to ensure that these packets can get through.

  • Minimizing the errors in the speech coding algorithm.

There are many factors in VoIP systems that contribute to high latency. As a point of reference, the current public phone system delay is kept under 10 milliseconds. When the delay approaches 50 milliseconds, echo cancellers need to be inserted into the voice path. People tolerate up to 250 milliseconds of delay. Earlier implementations of VoIP had 400 millisecond of delay and 77% found this delay to be unacceptable.

The delay sources are:

  • Serialization - the time to sample the data

  • Algorithm computation - the time to examine the data and compress it

  • Jitter buffering - variations of arrival time require that enough data be available until the next packet arrives.

Delay causes two problems: echo and talker overlap. Echo is caused by the signal being reflected back from the listeners equipment and that delayed signal is fed back into the speaker's ear. Since echo is perceived as a significant quality problem, Voice over Packet systems must insert echo cancellers in the path of calls that exceed the 50-millisecond delay.

Talker overlap (or the problem of one talker stepping on the other talker's speech) becomes significant if the one-way delay becomes greater than 250 msec. The end-to-end delay budget is therefore the major constraint and driving requirement for reducing delay through a packet network.

VoIP packet Overhead

The two conflicting goals of minimizing delay and removing jitter have required various schemes to adapt the jitter buffer size to match the time varying requirements of network jitter removal. This adaptation has the explicit goal of minimizing the size and delay of the jitter buffer, while at the same time preventing buffer underflow caused by jitter.

Trying to choose the proper number of voice bytes to send in an IP packet is very difficult because it becomes a tradeoff of network bandwidth vs. end-to-end delay. In addition to the voice data, every VoIP packet transmitted has at least 40 extra bytes required for the TCP, IP, and RTP headers. If the packet size is small so that there is short voice delay, then the 40-byte header becomes a significant drain on the network bandwidth.

The effects of keeping the packetization delay to 10ms are shown in Table 1. For example, using the G.729A coding, there are 10 bytes of voice in the 50 byte packet being sent. The resulting 40 kbits/sec of network bandwidth shows very little benefit of the additional compression complexity of generating the 8 kbits/sec voice coding rate.

Table 1. Voice Coding Data Rates



Bit Rate

Voice Frame Size





Per second



G.711 PCM

64 kb/s




96 kb/s

G.723.1 MPMLQ

6.3 kb/s




14.6 kb/s

G.723.1 ACELP

5.3 kb/s




12.3 kb/s


32 kb/s




64 kb/s


16 kb/s




48 kb/s


8 kb/s




40 kb/s


To help reduce the overall bandwidth, an algorithm is implemented to detect periods of speech silence and eliminate sending of the data. One of the simplest voice signal detectors is Voice Activity Detector (VAD) that detects active speech and pauses the input signal when there is an appropriate silence interval and VAD substitutes low bandwidth noise.

Figure 1 illustrates the simplest VAD mechanism. The input speech signal is compared with some threshold value. In case the amplitude is more than the threshold (the red line), the data from the T2 to the T3 interval is sent. If it's less (i.e. in T1 v T2 interval), then low volume/low bandwidth noise is sent.


Figure 1. Example of how VAD works


In addition to VAD, there are several methods of reducing the size of the header and thus reducing the bandwidth. Many of the header bytes do not change from one packet to the next on each individual data stream. Knowing that many bytes do not change, the Van Jacobson header compression algorithm is sometimes used to reduce the size of the header..


In data networks, there is often a need to have one external IP address be used by many IP addresses inside a company. For the home network, a family may buy a single cable or DSL modem and want that connection to be used by all the computers in the home. A Network Address Translator (NAT) is a device often contained within a router that performs that function.

For many applications, the presence of a NAT does not cause a problem. For VoIP though, the presence of a NAT is problematic. The problem arises when a call originating externally tries to place a call to the IP address assigned to the home. Since this IP address should be a single device, where should the NAT translate the address (e.g. which of the six computers should receive the call)?

Another network control device that can cause a problem is a firewall present on many networks. These devices are very necessary to protect the premise networks from hackers on the outside. One of the ways the firewall protects the premise network is that unless a device inside is expecting to receive data from a particular external address, the firewall will refuse the connection coming from the outside.

For someone trying to call a phone on the inside of the protected network, this is problem. Many of the calls initiate from the outside are unexpected and thus the internal phone can't notify the firewall that this incoming connection should be allowed.

PSTN/IP Gateway

The connection between the VoIP network and the Public Switched Telephone Network (PSTN) needs to link two very different types of networks together. Some of the important tasks required of this gateway device are to:

  • Map 10 digit phone numbers to IP addresses (and visa versa).

  • Translate PSTN networking signaling to VoIP messages (and visa versa).

  • Translate VoIP compressed voice to PSTN digital/analog voice.

  • If the call is from the PSTN to the VoIP network, the gateway needs to establish a connection to the destination that meets the QoS of the call and ensure that the underlying data network will support the requested level of service.

  • Exchange information with the PSTN that meets the needs of emergency services (e.g. 911), Federal Security guidelines (i.e. wiretap info), and billing systems (i.e. Call Detail Records).

  • There are also physical link mappings. For example to take many VoIP calls and put them on a large capacity PSTN trunk line, the interface is typically a GR303. In other cases, it may be a simpler T1 interface or, in its simplest form, a single analog telephone line.

Address Mapping

Data networks work on addresses. In IP systems, there is often a name-to-address translation system (DNS) that can assist in taking names (i.e. www.microsoft.com) and translating them to the IP address (i.e. In going from a VoIP network to the PSTN, the destination for the call can be a phone on the PSTN or it may be another VoIP phone that requires the use of the PSTN to get to the VoIP phone.

Should the call require using the long distance services of the PSTN, calling information and duration of the call need to be recorded so that accurate billing records can be kept and provided back to the caller.


Once the mappings between the PSTN number and IP address are resolved, the gateway initiates a message requesting to start a call. As this signaling message goes through the system, it may need to be rerouted. For example, people subscribe to call forwarding services.

VoIP call progress indicators need to be sent back to the originator. The special "fast busy" sound indicates the network is congested and a path to the destination is not currently available. The destination may already be on a call and a busy signal might be returned. In another case, the destination may have call waiting and a false ringing signal needs to be returned until the destination decides to answer it or perhaps goes to a voice mail system. While waiting, the caller ID information needs to be sent to the destination.

When the call is terminated, another set of messages needs to be exchanged to tear down the PSTN connection and the data network connection. Any Call Detail Records (CDR) need to have the information for this call logged for billing purposes.

It is also possible to have many combinations of multi-way calling. Any mixture of PSTN and VoIP phones is possible. In the analog world, this function can be done by adding all the analog signals together and sending them back down the line. VoIP bridging can be more involved when various compression algorithms are used at each phone.

Gateway Protocols

The major approaches in implementing these functions in a system are to use gateway protocols between network devices. The major gateway protocols are H.323 (the video conferencing standard), MGCP (Media Gateway Control Protocol), and SIP (Session Initiation Protocol).

The most widely embraced standard is the ITU-T's H.323 umbrella standard. Originally designed as an end-to-end communications standard for videoconferencing over packet networks, H.323 was adapted for voice-over-IP applications. The result was a standard that defines far more functionality than is necessary for most voice-over-IP environments. Its complexity is hard to implement efficiently and causes problems in interoperably since there are various ways to interpret the standard.

MGCP is a protocol that addresses control of media gateways, but it does not , as H.323 does, specify a complete end-to-end communication. MGCP uses simple endpoints called media gateways (MGs). An intelligent media gateway controller (MGC) or call agent (CA) provides services. The endpoint provides user interactions and interfaces, while the MGC provides centralized call intelligence. A master/slave relationship is preserved at all times between the MGC and the MGs. In fact, all changes of state are forwarded to the MGC via a series of relatively simple messages. The MG can then execute simple actions based on commands from the MGC.

When the user goes off-hook, the gateway notifies the MGC, which then instructs the MG to play the dial tone. When the user enters digits (DTMF) to dial a number, each digit is relayed to the MGC individually because the MG has no concept of a dial plan. It doesn't know when the user has dialed enough digits to complete a call. In a sense, the MG becomes a logical extension of the MGC.

Typically, MGCP messages are sent over IP/UDP between the MG and the MGC. The media connection (voice path) itself is usually over IP/RTP. For security, MGCP uses IPSec to protect the signaling information.

SIP is an application-layer signaling protocol that specifies call control for multiparty sessions, IP phone calls or multimedia distribution. Unlike H.323, which is based on binary encoding, SIP is a text-based protocol that is much easier to implement. Much like H.323, SIP is a peer-to-peer architecture (vs. master/slave for MGCP).

SIP depends on relatively intelligent endpoints, which require little or no interaction with servers. Each endpoint manages its own signaling, both to the user and to other endpoints. SIP is more scalable than H.323 because it is inherently a distributed and a stateless call model.

Perhaps the key advantage of SIP is that it is an Internet-model protocol from inception. It uses simple ASCII messaging based on HTTP/1.1. This means that SIP messaging is easy to decode and troubleshoot. Web-type applications can support SIP services with minimal changes. In fact, SIP fully supports URL (with DNS) naming in addition to standard E.164 North American Numbering Plan addressing. That means that in a SIP model, a user's e-mail address and phone address can be the same. It also means that the session is abstracted so that very different endpoints can communicate with each other.

More Information

Additional VoIP seminars:

An Introduction to VoIP - An overview of the VoIP technology, architecture, and the interconnection issues.

Voice Coding Algorithms - A description of the various methods for digitizing speech.

VoIP Applications - The VoIP technology only becomes useful when compelling applications meet the needs of customers. The corporate, cable telephony, and video conferencing applications are examined.


In Summary:

  • Existing data networks have had difficulty providing the Quality and reliability customers expect.

  • Packet technology requires additional bytes for the packet headers which increases bandwidth.

  • Interconnection with the existing phone networks is complex.

  • Firewalls and NATs make it difficult to have VoIP telephony.


2018 NextGen Datacom, Inc.