Cisco MC3810 Multiservice Access Concentrators

Introduction to Packet Voice Networking


Solutions Guide

Introduction to Packet Voice Networking in the Wide Area Network


The telephone is the most pervasive of all technology instruments, particularly in business. Every day, businesses make literally thousands of calls, and though the cost of an individual call is often low, the accumulated cost to business is significant. For many companies, a portion of that cost is avoidable. Traditional public voice telephony is a complex tapestry of tariffs and subsidies, often resulting in situations where calling from "A" to "B" costs a fraction of the rate from "B" to "A." Companies have long relied on private, leased-line networks to bypass public telephone charges, but rates applied to leased lines (especially international lines) are also often high. Many have looked for alternative strategies.

In today's networking, there are several attractive alternatives both to conventional public telephony and to private leased line networks between PBXs. Among the most interesting are networking technologies based on a different kind of voice transmission called packet voice. Packet voice appears to a network as "data"; thus it can be transported over networks normally viewed as data networks, where costs are often far less and multiservice applications are easily enabled.

Integrating voice calls with "data" technology is also becoming increasingly important for business productivity as well as E-commerce and Contact Center applications where employee interaction or customer reach is extended not only via a phone number but also via multimedia web-based and unified messaging applications. Many integrated voice and data—or multiservice—applications become imperative to business competitiveness and productivity. Deploying these applications requires a "data" network infrastructure that is ready to carry multiservice traffic.

Packet voice uses less transmission bandwidth than conventional voice, so more can be carried on a given connection. Where telephony requires as much as 64,000 bits per second (bps), packet voice needs as little as 10,000 - 15,000bps. For many companies, there is sufficient reserve capacity on their campus data networks to transport considerable voice traffic, making voice essentially "free". On national and international lines, raising the current data transport capacity a little can often easily accommodate the voice traffic sent today over a dedicated voice leased line.

Like all good things, packet voice has a price. While network designers are familiar with quality of service (QoS) requirements for specialized data applications like online transaction processing, packet voice often has more stringent QoS requirements. If the network is not properly conditioned to meet these requirements, the quality of the speech may be impacted. This is particularly true if voice is carried on public data networks like the Internet, where voice users have few options for securing end-to-end quality of service. However, on private networks, QoS can be tightly controlled and toll quality voice can be achieved with appropriate network design techniques.

Packet voice technologies have been deployed in both enterprise and service provider networks since 1998, and there are a number of books and web resources available on these technologies. In this paper, Cisco Systems provides a short "no-nonsense", introductory exploration of packet voice technology. A good, in-depth resource would be "Voice over IP Fundamentals" by Jonathan Davidson and James Peters (ISBN:1578701686).


Many packet voice systems follow a common model, as shown in Figure 1. The packet voice transport network, which may be IP-based, Frame Relay, or Asynchronous Transfer Mode (ATM), forms the traditional WAN "cloud." At the edges of this network are devices or components that can be called "voice gateways". It is the responsibility of these devices to change the voice information from its traditional telephony form to a form suitable for packet transmission. The network then forwards the packet data to a voice gateway serving the destination or called party.

Figure 1: Packet Voice Model

The voice gateway connection model in Figure 1 shows that there are three issues in packet voice networking that must be explored to ensure that packet voice services will meet user needs. The first issue is that of voice coding—how voice information is transformed into packets, and how the packets are used to recreate the voice. Another issue is the signaling associated with identifying who the calling party is trying to reach and where the called party is in the network. The last is how the packet network will provide for the transmission of the voice packets such that acceptable voice quality is achieved. We will explore these issues further in the sections that follow.

Voice Coding

Human speech, and in fact everything we hear, is naturally in analog form, and early telephone systems were likewise. Analog signals are often depicted as smooth "sine waves," but voice and other signals contain many frequencies and have more complex structures.

While humans are well equipped for analog communications, analog transmission is not particularly efficient. When analog signals become weak because of transmission loss, it's hard to separate the complex analog structure from the structure of random transmission noise. Amplifying analog signals also amplifies noise, and eventually analog connections become too noisy to use.

Digital signals, having only "one-bit" and "zero-bit" states, are more easily separated from noise and can be amplified without corruption. Over time, it became obvious that digital coding was more immune to noise corruption on long-distance connections, and the world's communications systems converted to a digital transmission format called Pulse Code Modulation or PCM. PCM converts voice into digital form by sampling the voice signals 8000 times per second and converting each sample into a code. The sampling rate of 8000 times per second (125 microseconds between samples) was chosen because virtually all of human speech intelligence is carried at frequencies below 4000 Hz (4 kHz). Sampling the voice waveforms every 125 microseconds is sufficient to detect frequencies below 4 kHz.

After the waveform is sampled, the samples are converted into digital form, with a code representing the amplitude of the waveform at the time the sample was taken. Standard telephone PCM uses 8 bits for the code and thus consumes 64,000 bps per call. Another telephone voice standard called Adaptive Differential PCM or ADPCM codes voice into 4-bit values and so consumes only 32,000 bps. ADPCM is often used on long-distance connections.

In traditional telephony applications, PCM or ADPCM is used on synchronous digital channels, which means that there is a constant stream of bits generated at the specified rate, whether there is conversation or not. There are, in fact, hundreds of brief silent periods in the average call, and each of them wastes bandwidth and money. On standard telephone connections, there is no alternative to this waste.

There is an alternative if packet voice transport is used. In packet voice applications, speech is transported as "data" packets, and these packets are generated only when there is actual speech to transport. The elimination of wasted bandwidth during periods of silence will, by itself, reduce the effective bandwidth required for speech transport by approximately one-third.

Voice Coding Standards

Other strategies can reduce bandwidth requirements even further. The International Telephony Union (ITU) has defined a series of standards for voice coding, including the 64 and 32 kbps PCM and ADPCM already briefly discussed. Anyone considering packet voice transmission should be aware of the characteristics of each of the voice coding strategies these standards cover.

The first set of these standards is the "fixed-sampling" standards, belonging to the G.711 family. These standards use the 8000-samples-per-second strategy for voice coding described earlier. For each sample, the voice coder stores the amplitude of the voice signal at that point. The result of the sampling is a rough, block-like representation of the original voice signal, as Figure 2 shows. The samples can be used (by smoothing) to reconstruct the analog voice signal at the other end of the call.

Figure 2: Pulse Code Modulation

The problem with the sampling strategies is that to reduce the bandwidth utilized for transport of the digital speech, it is necessary to code the voice signals into fewer bits. Using 8 bits for a sample allows recognizing 256 different levels of amplitude. To reduce bandwidth to 32 kbps, only four bits (64 values) are used, and the bit value represents the change from the prior value (the "differential" in ADPCM means this). ADPCM can be taken to 16 kbps by using only 2 bits (4 values), but each time the number of different voice amplitude values is reduced, the block representation of voice created is more unlike the original signal, and voice quality is degraded.

A second group of standards provide better voice compression and at the same time better quality. In these standards, the voice coding uses a special algorithm—called linear predictive code (LPC)—that models the way human speech actually works. Because LPC can take advantage of an understanding of the speech process, it can be much more efficient without sacrificing voice quality. Most LPC devices take as input the 64 kbps PCM discussed earlier for two reasons:

  • This form of voice is the standard output of digital PBXs and telephone switches.

  • PCM coding chips are inexpensive because of their broad usage in telephone networks.

  • Both LPC and PCM/ADPCM coding of voice information are standardized by the ITU in its G-series recommendations. The most popular voice coding standards for telephony and packet voice include the following:

  • G.711, which describes the 64 kbps PCM voice coding technique outlined earlier. G.711-encoded voice is already in the correct format for digital voice delivery in the public phone network or through PBXs.

  • G.726, which describes ADPCM coding at 40, 32, 24, and 16 kbps. ADPCM voice may also be interchanged between packet voice and public phone or PBX networks, providing the latter has ADPCM capability.

  • G.728, which describes code-excited linear-predictive (CELP) voice compression, requiring only 16 kbps of bandwidth. CELP voice coding must be transcoded to a public telephony format for delivery to or through telephone networks.

  • G.729, which describes adaptive CELP (ACELP) compression that enables voice to be coded into 8 kbps streams. There are four forms of this standard, and all provide speech quality as good as that of 32 kbps ADPCM.

  • G.723.1, which describes a coded representation that can be used for compressing speech or other audio signal component of multimedia services at a very low bit rate as part of the overall H.324 family of standards. This coder has two bit rates associated with it—5.3 and 6.3 kbps. The higher bit rate has greater quality; the lower bit rate gives good quality and provides system designers with additional flexibility.

Compression Quality

One might wonder why compressed voice isn't simply a standard concept; why not compress? The answer is that compression can only approximate the analog waveform. While the approximation may be very good for some standards, such as G.729, other standards will suffer somewhat from the compression approximation distortion, particularly if the voice is coded to digital form, recoded to analog, and then coded back to digital again. This practice of "tandem coding" should be avoided wherever possible in any compressed voice system.

The voice quality of a compression strategy has been measured by survey—the Mean Opinion Score (MOS) was the first commonly available measurement. On the MOS scale, where zero is poor quality and five is high, standard PCM has a quality of about 4.4. G.726 ADPCM is rated at 4.2 for the 32 kbps version. G.728 CELP coding achieves a rating of 4.2, and G.729 a score of 4.2. MOS scores are not standard and the results depend on the particular survey cited, as well as the language and gender mix of the participants. As these figures show, modern linear-predictive model voice coders often have better quality ratings than the older, sampling-based coders.

A newer, more objective measurement has become available and is quickly overtaking MOS scores as the industry quality measurement of choice for coding algorithms. Perceptual Speech Quality Measurement (PSQM), as per ITU standard P.861, also provides for a rating on a scale of zero to five, but here a rating closer to zero is better and five is the worst. Various vendors' test equipment is now capable of providing a PSQM score for a test voice call over a particular packet network.


Another factor—delay—may have a greater impact on compressed voice than the coding algorithm. Compressing voice for packet transport induces a delay; the table below shows the average delay associated with each of the popular coding standards described earlier. As shown in the table, the delay associated with voice code/decode can be as high as 25 ms for two CS-ACELP voice samples (an initial 5 msec for a look ahead plus 20 msecs for the two 10 byte frames). This delay of itself does not affect speech quality, although it may induce an increased need for "echo cancellation" so that an objectionable ringing or reverberation effect is not exacerbated. Most voice compression devices for packet voice include some form of echo cancellation. But other delay sources in the network may add to this base coding delay to induce an end-to-end delay sufficient to interfere with speech.

Compression Method MOS Score Delay (msec)

PCM (G.711)



32K ADPCM (G.726)



16K LD-CELP (G.728)



8K CS-ACELP (G729)



8K CS-ACELP (G.729a)



6.3 MPMLQ (G.723.1)



5.3K ACELP (G.723.1)



There are two sources of delay in both traditional telephone voice networks and packet voice networks; propagation delay and handling delay. The former is caused by the limitation of the speed of light in fiber or microwave networks or of electrons in copper networks. The latter is caused by handling of the voice by devices along the route.

Light travels 186,000 miles per second in a vacuum, and electrons travel about 100,000 miles per second in copper. A microwave or fiber network halfway around the world would span about 13,000 miles and induce a one-way delay of about 70 milliseconds. This level of delay is barely perceptible and almost never a problem.

Handling delays may impact traditional voice networks. Each T1/E1/J1 frame requires 125 microseconds to assemble in a switch and route to the destination line, assuming that each frame is being sent at its native speed (1.544 or 2.048 Mbps). This "transmission delay" accumulates as the frames are handled through the voice network; the total transmission delay can grow to 20 or more milliseconds on transcontinental links.

It is with handling delay where packet voice and traditional voice networks begin to differ. Handling delays in data networks can be considerable, particularly when networks become congested and traffic must be queued for transmission on busy trunk lines. When added to propagation delay, these factors can create a one-way delay exceeding 200 ms, which is perceptible to most listeners although still not objectionable. The Internet sometimes experiences end-to-end delays approaching one second on international routes. Where delays of this magnitude are possible, conversations may depend on a formal structure of "you-talk, I-talk" to ensure that both parties don't take advantage of a delay-induced "silent" period to begin talking at the same time. Private networks (both enterprise and service provider networks), however, are engineered to tightly control voice quality impairing delays with QoS measures discussed in more detail in a later section.

The reason why delay in data networks can cause voice quality problem is that voice information has a characteristic "timing." A particular syllable of a word is uttered with an interval of time between it and the following syllable. This tiny pause is as much a part of speech as the verbalized parts, and its timing must be preserved. In traditional voice networks, the voice channel is a synchronized bit stream that preserves timing of all speech elements precisely. In data networks, variable delay can be inserted by congestion or handling and can distort speech.

It should be clear from these comments that delay is a problem in two ways; delay in an absolute sense can interfere with the replay of human conversation, the rhythm of inquiry and reply. Delay variations (called "jitter") can create unexpected pauses between utterances that may impact the intelligibility of the speech itself. Both fixed delay and jitter are problems that packet voice networks must address.

Eliminating jitter in a network with variable handling and congestion is a matter of holdback. Voice applications monitor the average delay of a network, and hold back at the destination voice gateway enough compressed voice data to equal the average jitter. This feature ensures voice packets are released to be converted to real analog voice at a constant rate, regardless of the variation in network delay. Holdback (also called "dejitter buffering") creates greater absolute delay, of course, and networks with significant delay may thus have a total delay large enough to be perceptible to the parties of the conversation.

When dejitter buffering is used to control jitter, it is often necessary to provide a time-stamp on each voice packet to ensure that it is released to the listener with the same timing relative to other voice elements that was found in the input voice signal. In IP transport of voice, for example, this time stamp is provided by the Real Time Protocol (RTP).

Anything in the network that impacts delay also impacts voice quality, and this fact is critical to the design of packet voice networks. For example, in packet voice applications it is normally better to risk a lost or corrupted voice packet than to introduce an error recovery strategy that would increase jitter. This is why packet voice protocols are almost never provided with any form of error recovery. However, voice quality is also sensitive to packet loss and this aspect of the network must be tightly bounded to provide good quality voice.

In summary, packet voice coding improves network economics in two ways; first by reducing the bandwidth consumed by voice traffic, and second by eliminating silent periods. In order to take advantage of these benefits, the underlying transport network must be able to support small-bandwidth traffic streams, and interleave other traffic into silent periods in the voice calls to recover the idle bandwidth that packet voice transport produces. The facilities provided to ensure these capabilities vary depending on the type of network.

Packet Voice Transport Options

From the previous section on voice compression, it is clear that "tandem codings," delay, and packet loss are the most serious problems in packet voice networks. While all packet voice networks must address these issues, the sensitivity of the packet voice network options to each issue is different, and packet voice application design must consider the transport technology before deciding on how to deal with specific delay or loss concerns.

Packet voice can be transported over any of the following wide-area network connection types:

  • Leased-line networks. These networks are often based on T1/E1/J1 trunks leased from carriers, providing fixed synchronous bandwidth. Protocol encapsulations such as HDLC, PPP and MLPPP are typically used on leased lines.

  • ATM constant-bit-rate (CBR) or circuit emulation connections. These connections emulate circuit-switched network connections and are sometimes called ATM Class A services.

  • ATM connections based on what in ATM are called Variable Bit Rate (VBR), Available Bit Rate (ABR), or Unspecified Bit Rate (UBR) classes of service.

  • Frame Relay networks, both those provided by public service providers and those built by corporations as private networks. These provide public data services in many international locations and are also widely used as a network access technologies from smaller branch office locations in the U.S. and Europe.

  • Public IP networks, including the Internet.

These many choices, fortunately, can be grouped into broad classes for voice application consideration.

Frame/Cell Networks

Frame/cell networks, using Frame Relay or ATM variable bit-rate services, transport packetized voice in compressed, coded form. For these networks, a voice gateway is necessary to code the voice into cells or frames for transport and decode these cells or frames at the destination. The voice gateway must also understand any telephone signaling used by the voice source and destination to receive the called party's number and deliver call progress signals. Finally, the voice gateway must understand the signaling or addressing needed within the frame/cell network cloud to reach the various destination voice gateways. This capability is important when translating between a traditional voice network and a frame/cell network.

ATM networks typically have low overall jitter and VBR real-time service from the network ensures that voice packets are transported efficiently within the backbone to minimize delay and jitter. Voice over ATM networks come in various flavors:

  • AAL5-based—packetized voice using ATM AAL5-type cells. These are pre-standard Cisco solutions and can be deployed using separate voice and data virtual circuits or mixed voice and data traffic on the same virtual circuit. This method of Voice over ATM is most often used for "toll bypass" enterprise networks.

  • AAL2-based—the ATM Forum standardized Voice over ATM as using AAL2-based cells. These solutions are standards based and most often are deployed for PSTN access where voice and data use separate virtual circuits, as data traffic typically does not use AAL2 cells.

  • Circuit-emulation services—this is a mode of using ATM to provide trunking between two points, and typically uses AAL1 cells with CBR service. The bandwidth savings typical of packetized voice solutions are not realized over circuit-emulation services, but high quality voice and video can be transported across an ATM network in this fashion.

Frame Relay networks may have higher delay and jitter than ATM, but is eminently workable for voice applications. Unlike ATM, Frame Relay is often deployed as a low-speed (from about 56K to 512K) remote access technology which brings its own challenges in terms of preserving voice quality over these links. Key in Frame Relay is the contracted data rate (called Committed Information Rate, or CIR) purchased for the link. To ensure proper voice quality, traffic shaping to CIR must be strictly adhered to.

For Voice over ATM and Voice over Frame Relay networks, the voice gateway must be attached to the telephony equipment on one side as well as to the ATM or Frame Relay network on the other side. For this reason, these technologies are WAN (Wide Area Network) voice technologies and usually do not fit well in campus or LAN environments.

IP Networks

With connectionless data networks such as IP intranets and the Internet, the same voice coding and addressing issues apply as stated for the frame/cell networks. With this type of network, however, there is normally no network-guaranteed level of delay or jitter, and it is often necessary to take special steps to ensure that the delay "budget" for the network is kept within reasonable bounds. For example, high-level protocols like Transmission Control Protocol (TCP) provide flow control and error recovery that combine to create significant jitter. For this reason, TCP is not used as a Layer 4 voice transport protocol.

Instead, voice traffic is carried within IP's User Datagram Protocol (UDP); unfortunately there is no time stamp provided to control output timing, and even small amounts of jitter can interfere with speech comprehension. To prevent this problem, most IP-based voice standards call for voice packet transport using the Real-Time Protocol (RTP) that resides on top of UDP. RTP provides time-stamp services, and also (via a Real-Time Control Protocol, or RTCP) allows for the establishment of point-to-multipoint voice connections—a feature rarely available with other packet voice transport options.

IP is a Layer 3 protocol, so all IP networks are riding on an underlying Layer 2 network. This network may be a Frame/Cell network as discussed above, or it may be based on leased line technology (using protocols such as HDLC or PPP for packet transmission), or it may be an Ethernet-based network (IEEE 802).


In most packet voice applications, delay plays a critical role in voice quality. With LPC voice coding, such as that provided in G.729, packet voice quality in a low-delay environment equals the standard toll-quality voice experienced in public telephony. Maintaining the low delay is the critical problem.

Frame Relay and ATM networks are designed to deliver the lowest practicable transport delay; special delay management measures are rarely required here except where the user's voice gateway attaches to the network. Utilization on these connections should be maintained below 70 percent to ensure that delay is manageable; 50 percent and lower is even better. Often this means selecting an access line speed higher than would normally be selected for data-only applications.

In connectionless networks like IP, delay can be managed in a variety of ways that will be discussed in a later section on QoS. If network bandwidth is sufficient to carry traffic at all times, then special measures are only needed on the typically slow access links into the backbone of the network. However, if points within the network are vulnerable to congestion, special QoS measures should be deployed here to prioritize voice packets over data packets and selectively drop packet if packet loss is necessary to resolve the congestion.

In general, utilization levels above 70 percent result in sharp increases in congestion delay for small increases in traffic, so high-utilization networks are more likely to experience speech quality problems. However, network economy depends in large part on securing high levels of utilization, so router-based delay management strategies (QoS) are preferable.

Which strategy is best for packet voice transport? Usually it is one that a business already has in use to transport data, but all forms of public frame and cell networking should be examined and compared for price and service quality. A compelling issue is often the pricing policies of public services. In many countries, public data services like Frame Relay are priced without a distance-sensitive component. For long-haul connections, these services may be significantly more economical than voice telephony or leased lines, both of which are usually priced based partly on the distance between connection points.

Public network tariffs may also subsidize residential services by pricing business calls and international calls higher. These subsidies create an artificial economy for any form of packet voice, and despite the fact that the value is created by pricing policy and not network technology, the savings are just as real. However, national administrations may view attempts to evade these higher rates as contrary to public policy, and networks that do this may be illegal in some countries. Issues relating to the legality of packet voice must be examined on a country-by-country basis.

Signaling: Making the Voice Connection

In many ways the questions of how voice is coded and how voice quality is ensured are the easy parts of packet voice. A useful packet voice application like the one shown in Figure 1 requires callers in one area to be able to connect to a voice gateway that serves them using their standard dialing mechanism and to place calls to users who are accessible on other voice gateways or the PSTN.

External Signaling: Gateway to Telephony Network Signaling

Voice gateways must be interconnected to the voice source and destination in a manner consistent with normal phone system operations, or packet voice will not appear to work the same as traditional telephony. There are four options for external signaling commonly supported in packet voice systems:

  • Standard dual-tone multifrequency (DTMF) or pulse analog signaling of the type used on telephone sets or small telephone key systems. This type of signaling is appropriate for packet voice applications where standard phone instruments are to be plugged into the voice gateway directly using the telephone jack type appropriate to the national administration.

  • Analog tie-line signaling, also called E&M signaling, used most often on four-wire analog trunks, where the voice gateway connects to a PBX.

  • Digital in-band signaling, called channel-associated signaling (CAS) and used on T1/E1 digital trunks. Standards for CAS vary among the major world geographies; North America and Europe, for example. With CAS, signaling information travels on the same paths as voice information.

  • Digital out-of-band signaling, called common-channel signaling (CCS), in which all signaling for a multiconnection digital trunk (T1/E1/J1) is combined onto one or more common channels, separate from the voice information. This type is employed by PBXs and the PSTN, for example Q.931 ISDN.

Another form of signaling is used in conjunction with public phone networks when connecting as a peer to the PSTN, as opposed to connecting as a user (residential phone set, or a business connection from a PBX). Signaling System 7, or SS7, is an internal signaling protocol to such networks, operating out of band between network elements to connect calls and request special services such as 800 numbers or free phone number decoding. Certain packet voice products support SS7 as an external signaling protocol to connect into the PSTN as a peer.

When a packet voice network interprets telephony signaling protocols at the voice gateway and connects calls across the packet transport network based on signaled numbers, the packet voice network is acting as part of the switching fabric in the voice network. These "infrastructure" switches and the source switches (PBXs, for example) combine to create the numbering plan or dialing plan for a network. This plan associates specific stations with specific dialed numbers, and it must be complete and consistent for the network to function properly. It is important that a packet voice network have adequate facilities for managing its part of the dial plan and that it is consistent with the numbering plan of the voice networks it serves.

Internal Signaling: Gateway to Gateway Signaling across the Voice Packet Network

Internal signaling within the packet voice network must provide two features; connection control and call progress or status information. Connection control signaling is used to create relationships or paths between voice gateways to route a packet flow to the proper voice gateway after the identity of that gateway is known via a dial plan function. Call progress or status information is exchanged by voice gateways to signal the state of the call—ringing, busy, and so forth.

The individual transport network options such as ATM, Frame Relay, and IP all have their own signaling standards; ATM's is Q.931, and Frame Relay packet voice signaling is described in FRF.11. These differing standards could be supported on the same packet voice gateway, but seldom interwork.

There are several different standards for internal signaling, both connection and call progress, for IP packet voice networks, including H.323, SIP (Session Initiation Protocol) and MGCP (Media Gateway Control Protocol). H.323 is an ITU standard, while SIP and MGCP are IETF standards. H.323 is the most mature of these and enjoys considerable vendor support in the industry and a large installed base in both enterprise and service provider networks. SIP, although still in an exploratory phase, allows much easier tie-ins to web-based protocols such as MIME, HTTP, LDAP and others, and promises much easier integration with true multiservice web-based applications.

H.323 defines a complete multimedia network, from devices to a suite of protocols. Linking all of the entities within H.323 is H.245, which is defined to negotiate facilities among participants and H.323 network elements. A scaled-down version of ISDN's Q.931 call protocol is used to provide for connection setup, defined in H.225. In H.323 terms, there are voice gateways and terminals (the common usage of this concept suggests a single user) as well as a gatekeeper function that performs the address translations and lookups required for scalable implementation of dial plans in the packet voice network.

Like H.323, SIP is a distributed or peer-to-peer protocol, with enough intelligence in the endpoints and voice gateways to set up and control a session directly between them, while the architecture allows for servers to aid with functions such as dial plans and locating the destination party's IP address. The SIP architecture defines end-user devices, called SIP User-Agents (UA), and services like device registration, call routing, call redirection, feature invocation and device notification that may be provided by a SIP server.

MGCP is a centralized or master-slave architecture where the call processing knowledge is centralized in the Call Agent. The endpoints and gateways are controlled by the Call Agent via low-level stimulus messages. The Call Agent interprets all signaling, receives the digits, interprets the dial plan and makes connections by instructing the stateless endpoints and gateways to make a physical connection between each other. The protocol assumes that multiple call control elements will synchronize with each other, sending coherent commands to the gateways under their control. MGCP does not define the mechanism for synchronizing between multiple Call Agents, and SIP has recently emerged as a protocol that can provide this function in MGCP-based networks. While H.323 and SIP are strictly IP-based protocols and architectures, MGCP can be deployed with Voice over IP or Voice over ATM technologies.

One of the major advantages of voice over IP protocols is that they can be carried over virtually any transport network, including ATM, Frame Relay or Ethernet. For voice media packets, they typically use the RTP and RTCP control protocols, which in turn are carried on UDP. Signaling information is carried in a variety of ways, including UDP and TCP.

Where multiple populations of voice users are to be reachable via the same network, meaning that there could be many voice gateways on the packet voice transport cloud, there is a chance that these voice gateways will be provided and supported by different organizations and/or vendors. For these networks it is easiest if all the endpoints and gateways use the same protocol, say H.323. However, as networks migrate to the newer protocols SIP and MGCP, mixed networks will be a fact of life and gateways, or translation points between these, are currently under development in the industry to provide for interworking of disparate networks.

Quality of Service (QoS) in Packet Networks

Quality of Service is an indispensable part of network design to ensure the packet voice network meets end user voice quality expectations. The need for this was clearly illustrated by early experiments with Internet phone calls where no QoS was available. With QoS capabilities developed and deployed since those early experimental Internet phone calls—predominantly implemented on today's enterprise and service provider networks and to a much lesser extent in the Internet although this will surely follow—current packet voice networks can be designed to offer PSTN or toll-quality voice.

However, it is a myth that QoS measures should be deployed only to ensure voice quality—QoS is a broader concept of treating traffic in such a way that the end user expectations for that traffic is met in terms of throughput and response time. Real-time traffic like voice has much stricter latency requirements, but it is not the only traffic type that must be engineered for bandwidth and throughput to meet business goals. QoS methods have a somewhat lesser role to play when there is no congestion in the network, although they can improve bandwidth utilization and throughput at any level of traffic. But it is at the points of congestion where bandwidth guarantees and priority to certain packets over others become critical.

The major factors determining voice quality (other than voice coding choice) include:

  • Echo—this appears in any network whenever analog voice is used and must be compensated for by using echo cancellation techniques.

  • Delay—earlier sections explored types and causes of delay.

  • Packet loss—this results when there is congestion in the network. To compensate for this, the network must be designed to selectively drop packets only from traffic types that can tolerate loss.

Many of the QoS tools in a packet voice network are geared towards minimizing delay and jitter as well as eliminating loss for voice packets. This requires the network to recognize a voice packet and to apply differentiated treatment to voice packets vs. data packets.

The IETF's Architecture for Differentiated Services (RFC 2475) describes three broad types of traffic:

  • Low Latency, Guaranteed Delivery—this includes voice as well as any other traffic sensitive to delay and packet loss.

  • Guaranteed Delivery—this includes mission-critical traffic. This traffic is not latency sensitive (within reason) and can tolerate controlled amounts of packet loss.

  • Best Effort—traffic that is either irrelevant to business needs (e.g. employee personal web surfing), or can be delayed significantly with minimal business impact (e.g. nightly backup traffic), or can recover from loss with reasonable end user results (restart a failed file transfer session).

For the network components to provide the appropriate treatment to these categories of traffic, it is necessary that packets must be classified into the category where they belong. This classification is often based on business policy as much as the characteristics of the application under consideration.

Cisco IOS® software includes a wide array of QoS mechanisms and tools. A sampling of these pertaining to designing networks that carry voice traffic include:

  • Packet Classification—mechanisms such as Access Lists to recognize particular streams of traffic.

  • Packet Marking—tools to write a value into the IP Precedence, DiffServ Code Point (DSCP), or 802.1p Class of Service field of an IP packet.

  • Queuing—software features that will buffer packets while they are waiting to be transmitted on an egress interface. Low Latency Queuing (LLQ) is the technique used to guarantee preference to packets that are sensitive to latency (e.g. voice packets) over other packets.

  • Traffic Shaping—a technique used to rate-limit traffic, usually to comply with a service provider's service level agreement such as Frame Relay's CIR. Rate limiting traffic also implies buffering of packets, so the traffic shaping and queuing techniques work hand-in-hand.

  • Packet Drop decisions— features to drop packets selectively when congestion occurs, for example Weighted Random Early Detection (WRED). These techniques preferentially drop data packets of traffic flows that are either unimportant to the business goals of a company, or can recover from packet loss via retransmission techniques, or where the application is insensitive to packet loss. Voice packets should never be dropped.

  • Link Fragmentation and Interleaving (LFI)—mechanisms needed on slow WAN access links (sub T1 rates) where the delay to transmit a large data packet onto the link is long enough to delay a waiting voice packet unnecessarily. These techniques (including FRF.12 and MLPPP) fragment data packets into smaller sizes so that voice packets can be interleaved in between.

  • Call Admissions Control (CAC)—most of the above QoS mechanisms are designed to protect voice packets from data packets. But voice quality can also impaired if there is congestion in the network due purely to voice traffic. CAC techniques make a decision prior to call setup to determine whether or not sufficient resources are available in the network to reasonably carry this voice call. If not, the call is denied access to the network and an alternate path, perhaps through the PSTN, can be selected. One example CAC technique is RSVP (Resource Reservations Protocol).

In summary, QoS techniques ensure the proper treatment of traffic to deliver the desired quality of service and/or response time to all classes of traffic on the network.

Applying Packet Voice

Packet voice networks can be used in two broad contexts—differentiated by geography or by the types of users to be served. The economics and technology of the network may be unaffected by these factors, but there may be legal constraints in some areas for some combinations of these two contexts, and network users or operators should be aware of them.

Telecommunications is regulated within countries by "national administrations" or arms of the governments, based on local regulations. In some countries, such as the U.S., there may be multiple levels of regulatory authority. In all cases, treaties define the international connection rules, rates, and so forth. It is important for any business planning to use or build a packet voice network to assure itself that it is operating in conformance to all laws and regulations in all the areas the network serves. This normally requires some direct research, but the current state of the regulations can be summarized as follows:

  • Within a national administration or telephony jurisdiction, it is almost always proper for a business to employ packet voice to support its own voice calling among its own sites.

  • In such applications, it is normally expected that some of the calls transported on the packet voice network will have originated in the public phone network. Such "outside" calling over packet voice is uniformly tolerated in a regulatory sense, on the basis that the calls are from employees, customers, or suppliers and represent the company's business.

  • When a packet voice connection is made between national administrations to support the activities of a single company—to connect two or more company locations in multiple countries—the application is uniformly tolerated in a regulatory sense.

  • In such a situation, an outside call placed from a public phone network in one country and terminated in a company site within another via packet voice may be a technical violation of national monopolies or treaties on long-distance service. Where such a call is between company employees or between employees and suppliers or customers, such a technical violation is unlikely to attract official notice, however.

  • When a packet voice network is used to connect public calls within a company, the packet voice provider is technically providing a local or national telephone service and is subject to regulation as such.

  • When a packet voice network is used to connect public calls between countries, the packet voice provider is subject to the national regulations in the countries involved and also to any treaty provisions for international calling to which any of the countries served are signatories.

Thus, it is safe to say that companies could employ packet voice networking for any applications where traditional leased-line, PBX-to-PBX networking could be legally employed. A good model for deploying packet voice without additional concerns on regulatory matters is to duplicate an existing PBX trunk network or tie-line network using packet voice facilities.


The public telephone network of today is in many ways unchanged from the networks of the early 1980s. During this period, there have been great advances in data network technology that have both improved network economics and improved control over network quality of service. It is those advances that have spawned today's increasing deployment of packet voice.

Packet voice transport using advanced compression algorithms like G.729 can transport as much as 5 times the voice traffic per unit of network bandwidth as the PCM-based networks used in public telephony. Users with existing data networks can often interleave their voice traffic with data at little or no additional transport cost and little or no impact on application performance. Users with circuit-switched T1/E1/J1 voice networks can, with packet voice transmission, often free up enough bandwidth from existing voice trunks to carry their entire data load.