This document reviews the subject of video call quality and provides a tutorial on things to keep in mind while Quality of Service (QoS) is configured on a Cisco Unified Border Element(CUBE) or a Time-Division Multiplexing (TDM) gateway.
Contributed by Baktha Muralidharan, Cisco TAC Engineer, edited by Anoop Kumar.
This document is most beneficial for the engineers familiar with voice over IP (VoIP), although others might find it useful.
There is no specific hardware or sofware used to write this document.
Digitized audio in its simplest form is a set of audio samples, each sample describing the sound pressure during that period. Conversational audio can be captured and reproduced to a high degree of accuracy, with just 8000 samples per second. This then means that as long as the network is able to transport the samples without excessive delay, jitter and packet loss, audio can be faithfully reproduced at the other end.
In contrast presentation, processing and transport of video is much more complex. Brightness, contrast, color saturation, responsiveness (to motion) and lip-sync are just some of the attributes that determine the quality of the video. Video samples generally require much larger space. Not surprisingly, video places a much larger demand on network bandwidth, on the transport network. Audio quality is determined by :Microphone Speaker in the headset Codec - compression transport network video call quality is affected by: Camera Display device Video codec Transport network Compatibility/Interoperability
Note: It is important to understand that unlike audio, quite a bit goes on at video endpoints, when it comes to tuning quality.
QoS in general is a vast and complex subject requiring consideration of overall traffic requirements (rather than just the traffic you wish to improve the quality of) and needs to be checked on every network component along the path of the media flow. Achieving video quality in a video conference is even more complex as it involves in addition to the network components, review and examination of configuration and tuning at the endpoints. Broadly, video quality entails this:
End-point tuning- Optimizing the configuration of endpoints (e.g. resolution, frame-per-second)
Transport optimization- Optimizing the network to transport the video traffic per network SLA.
Interoperability considerations- Quite often video calls involve endpoints of varied capabilities. Designing and configuring the systems to maximize interoperability can impact video quality.
The specific focus in this document will be the QoS considerations on the IOS gateway or CUBE when handling video calls.
Tuning at the endpoints would involve adjust a set of parameters on the video endpoints. This of course depends on the product but here are a few general “knobs” :
Resolution (i.e. picture size)
Frame rate (i.e. motion sensitivity/reality)
Tagging (i.e. ToS marking)
Tuning the network for video generally involves the following :
Understanding the composition of traffic flowing through the CUBE- e.g. peak [call] volume etc.
Reviewing network link/pipe capacity
Designing appropriate QoS policies, to ensure SLA is met for each traffic class
Interoperability comes into play when heterogeneous (video telephony as well as telepresence (TP)) systems participate in a conference call. The experience provided by a TP and video phone system are fundamentally different. Interoperability between them is generally achieved by bridging them using a process known as cascading.
What this doesn’t cover
This is not a design document and not a comprehensive video QoS document either. Specifically this document does not cover these topics:
Signaling [protocols] of video calls, beyond what is required to illustrate qos-related aspects.
Video Endpoint setup/configuration
Comprehensive review of QoS mechanisms including Policing, Queuing, Shaping and Burst
Review of QoS config on Layer 2 switches or trust boundary considerations.
Network traffic characteristics of video
Video, like audio is real-time. Audio transmissions are constant-bit-rate (CBR). In contrast, video traffic tends to be bursty and is referred to as being variable-bit-rate(VBR.) Consequently bit rate for video transmission will necessarily not be constant, if we need to maintain a certain quality.
Determination of bandwidth and bursting required for video is also more involved. This is discussed later in this document.
Video traffic is bursty.
Video packets can be quite large.
Audio is always CBR. Video is typically VBR.
Why is video bursty?
The answer lies in the way video is compressed. Remember that video is a sequence of images (frames) played to provide a visual motion effect. Compression techniques used by video codecs use an approach called Delta encoding, which works by storing values of bytes as differences (deltas) between sequential (samples) values rather than the values themselves. Accordingly video is encoded (and transmitted) as consecutive frames carrying just the “moving parts” rather than whole frames.
You are probably wondering Why, the audio changes incrementally too ? Well, true enough, but “motion” (or dynamics) doesn’t impact audio nearly as much as it does video. The 8-bit audio samples do not compress better when delta encoded, video samples (frames) do. The relative change from sample (frame to frame) to sample is video is much smaller than that in audio. Depending on the nature and degree of motion, video samples can greatly vary in size. Image 2 illustrates video compression-
An I‑frame is an Intra-coded picture, in effect a fully specified picture, like a conventional static image file.
A P‑frame (Predicted picture) holds only the changes in the image from the previous frame. The encoder does not need to store the unchanging background pixels in the P‑frame, thus saving space. P‑frames are also known as delta‑frames.
A B‑frame (Bi-predictive picture) saves even more space by using differences between the current frame and both the preceding and following frames to specify its content.
Video quality measurement
Cisco video gear do not measure or report on video quality as such, so video quality is perceived rather than measured. There are standardized algorithms that measure quality by means of a MOS (Mean Opinion Score). However, if issues reported on audio-quality are any indication, video quality (TAC) cases are more likely to be opened because user perceived quality issues rather than reports by a tool.
Controls at endpoints
Factors that affect video quality include:
the video codec (MPEG4, H261, H263, H264 & H265)
size (1/8th screen, 1/4 screen, full screen)
frame rate (1 to 30 frames per second, 6 default)
the compression quality setting (low, medium, high)
Generally each of the above is selectable/controllable at endpoints.
Quilting, Combing & Banding get used to these terms, part of video impairment taxonomy. Refer to this document for details on the common video impairments:
Recommended network SLA for video is as follows:
Latency ≤ 150–300ms
Jitter ≤ 10 ms–50ms
Loss ≤ 0.5%
Incidentally the recommended network SLA for transporting audio are:
Latency ≤ 150–300ms
Jitter ≤ 20 ms–50ms
Loss ≤ 1%
Note: Clearly Video is more sensitive to packet loss than voice. This should be expected once you understand that interframes require information from previous frames, which means that loss of interframes can be devastating to the process of reconstructing the video image.
Controls in the transport network
Generally the SLA for video transport can be delivered using QoS policies that are very similar to those used for audio transport. There are some differences however owing to the nature of video traffic.
Note: Although the scope of this document is limited to the CUBE component, remember QoS is end-to-end.
Are all video same? Well, not quite. The variations of video as a medium include:
Video telephony/Video conferencing
Relatively lower bandwidth. Up to approx. 1Mbps
Requires very high bandwidth
Can be unicast or multicast
Not delay sensitive (the video can take several seconds to queue up)
Largely insensitive to jitter (because of application buffering)
Loss should be no more than 5 percent.
Latency should be no more than 4 to 5 seconds (depending on the video application's buffering capabilities)
Some video (e.g. entertainment) might be considered for Scavenger service.
Note: In the interest of brevity, illustrations are not extensively provided for each type of video listed above.
Video traffic codecs
H.261- Codec was originally designed for transmission over ISDN lines. With the use of video bonding, video bit rates are multiples of 64 Kbit/s.
H.263 - Codec is used in IP-based videoconferencing as well as in ISDN networks. H.263 requires half the bandwidth to achieve the same video quality as in H.261. As a result, H.263 has largely replaced H.261. H.263 has been optimized for a large range of bit rates and not just 64K bits/s like with H.261.
H.264/MPEG-4 - Cs currently one of the most commonly used formats and uses half or less the bit rate of MPEG-2, H.263, or MPEG-4 Part 2.
H265 - One of several potential successors to the widely used H.264 and based on extension of same concepts. It supports resolutions up to 8192×4320, including 8K UHD.
Note: Video, like audio, is carried in Realtime Protocol (RTP)
QoS mechanisms for Video
In principle the QoS mechanisms employed to deliver the SLAs for a video transport network are mostly the same as those for audio. There are some differences however, mostly due to the bursty nature of video and VBR transmission.
There are two approaches to QoS, namely Interated Services(intserv) and differentiated Services(diffserv).
Think of Intserv as operating at signaling level and diffserv at media-level. In other words, the intserv model ensures quality by operating at control plane; diffserv aims to ensure quality by oeprating at date plane level.
In IntServ architecture network devices make requests for static bandwidth reservations and maintain the state of all reserved flows while performing classification, marking and queuing services these flows; the IntServ architecture operates-and integrates-both the control plane and the data plane, and as such has been largely abandoned due to inherent scaling limitations. The protocol used to make the bandwidth reservations is RSVP (Resource reSerVation Protocol).
There is also IntServ/DiffServ Model, which is sort of a mix. This model separates control plane operations from data plane operations. RSVP operation is limited to admission control only; with DiffServ mechanisms handling classification, marking, policing and scheduling operations. As such, the IntServ/DiffServ model is highly scalable and flexible.
Note: This document only focuses on diffserv (viz-a-viz prioritization scheme, LLQ) apprach.
Bandwidth is obviously the most fundamental qos parameter. This depends on several parameters, most notably:
Call volume (peak and average)
The old trick of throwing bandwidth at the problem is not always the solution. This is especially true for video quality. For example, with CUVA (Cisco Unified Video Advantage) there is no synchronization mechanism between the two devices (phone and PC) involved. Thus QoS should be configured to minimize jitter, latency, fragmented packets, and out-of-order packets.
Note: Interactive Video has the same service level requirements as VoIP because a voice call is embedded within the video stream.Streaming Video has much laxer requirements, because of the high amount of buffering that has been built into the applications.
Finally it is important to understand that unlike VoIP there are no clean formulas for calculating the required incremental bandwidth. This is because video packet sizes and packet rates vary significantly and are largely a function of the degree of motion within the video images being transmitted. More on this later.
Low Latency Queuing (LLQ) is the preferred queuing policy for VoIP audio. Given the stringent delay/jitter sensitive requirements of TP and the need to synchronize audio and video for CUVA, priority (LLQ) queuing is the recommended for all video traffic as well. Note that, for video, priority bandwidth is generally fudged up by 20% to account for the overhead.
Not recommended for video.
Link fragmentation and interleaving
LFI is a popular mechanism to ensure jitter doesn’t get out of control on slow links, where serialization delays can be high.
But then again Interactive-Video is not recommended for slow links. This is because LLQ which the video traffic is assigned to, are not subject to fragmentation. This means that the large Interactive-Video packets (such as 1500-byte full-motion I-Frames) could cause serialization delays for smaller Interactive-Video packets.
Selective discard based on RTCP
This QoS mechanism is an important for video traffic, which, as mentioned earlier, is bursty.
The optional burst parameter can be configured as part of the priority command.
With H.264, the worst-case burst would be the full screen of (spatially-compressed) video. Based on extensive testing on TP systems, this is found to be 64 KB. Therefore the LLQ burst parameter should be configured to permit up to 64 KB of burst per frame per screen. Thus the CTS-1000 system running at 1080p-Best (with the optional support of an auxiliary video stream) would be configured with an LLQ with an optimal burst parameter of 128 (2x64) KB.
How much Bandwidth?
So, how much bandwidth is required to transport a video call faithfully? Before we get down to the calculations, it is important to understand the following concepts, which are unique to video.
This basically refers to the size of the image. Other commonly used terms for this include video format and screen size. Commonly used video formats are shown below.
Video Resolution (pixels)
Vast majority of video conferencing equipment run at CIF or 4CIF formats.
How is video traffic identified/distinguished? One way to classify packets on CUBE is using DSCP markings.
The following table illustrates DSCP markings per Cisco QoS baseline as well as RFC 4594.
Layer 3 PHB
Layer 3 DSCP
PHB - Per Hop Behavior. Refers to what the router does as far as packet classification and traffic conditioning functions, such as metering, marking, shaping, and policing.
By default, prior to version 9.0 CUCM (Cisco Unified Call Manager) marked any and all video traffic (including TelePresence) to AF41. Starting from version 9.0, CUCM preconfigures the following DSCP values:
TelePresence (immersive video) calls at CS4 and
Video (IP video telephony) calls at AF41
Configuring to tune for audio quality entails calculating priority bandwidth and implementing LLQ policy on a WAN link. This is generally based on the anticipated call volume and audio codec used.
While the principles are same, video bandwidth through a CUBE is not so easily calculable. This is due to a number of factors, including:
How does one calculate the total bandwidth required given different TP calls (flowing through the CUBE) might involve different number of screens and of different resolutions?
The bursty nature and VBR
Another dimension of complexity [in bandwidth computation] has to do with “interoperability calls”? Interoperability calls use TIP. TIP stands for Telepresence Interoperability Protocol. TIP is used to multiplex multiple screens, multiple audio streams, as well as an auxiliary-data screen into two RTP flows, one each for video and audio. It enables point-to-point and multipoint sessions as well as a mix of multi-screen and single-screen endpoints. TIP is a Cisco proprietary protocol. TIP is based on RTCP.
Therefore, the bandwidth provisioning for video systems sometimes happens in the reverse order- i.e. amount of bandwidth that a transport network can deliver, with LLQ policy, is determined first and based on that, the endpoint is configured. Endpoint video systems are smart enough to adjust the various video parameters for the pipe size! Accordingly, the endpoints signal the call.
CUBE bandwidth handling
So, how does CUBE handle Bandwidth in its (SIP) offer/answers when signaling video calls? CUBE populates the video bandwidth fields in SDP as follows-
1. From bandwidth attribute in the incoming SDP. In SDP, there exists a bandwidth attribute, which has a modifier used to specify what type of bit-rate the value refers to. The attribute has the following form: b=<modifier>:<value>
2. From video bandwidth configured on the CUBE. For example, the estimated maximum bandwidth is calculated based on the features used by CTS user and the estimated bandwidth is pre-configured on CUBE, using the CLI-
<bandwidth video tias-modifier> or
<bandwidth video as-modifier>
3. Default video bandwidth (384 Kbps)
The call flow shown below illustrates how CUBE populates bandwidth in call signaling messages-
Specifically, the CUBE uses the following logic:
On offers (to DO calls), CUBE uses the configured bandwidth.
On (answers to EOs), CUBE sends bandwidth whose value is the minimum of offer & local configuration.
At the SDP session level, the TIAS value is the maximal amount of bandwidth needed when all declared media streams are used.
Video codec Payload types
This is another area in which video differs from audio. Audio codecs use static payload types. Video codecs, in contrast, use dynamic RTP payload types, which use the range 96 to 127.
The reason for the use of dynamic payload type has to do with the wide applicability of video codecs. Video codecs have parameters that provide a receiver with the properties of the stream that will be sent. Video payload types are defined in SDP, using the a=rtpmap parameter. Additionally, the "a=fmtp:" attribute MAY be used to specify format parameters. The fmtp string is opaque and is just passed to the other side.
Note that the two endpoints involved in a call might use different payload type for the same codec. CUBE responds to each side with a=rtpmap line received on the other leg. This means that the config "asymmetric payload full" is needed for video calls to work.
Unlike voice, real-time IP video traffic in general is a somewhat bursty, variable bit rate stream. Therefore video, unlike voice, does not have clear formulas for calculating network overhead because video packet sizes and rates vary proportionally to the degree of motion within the video image itself. From a network administrator's point of view, bandwidth is always provisioned at Layer 2, but the variability in the packet sizes and the variety of Layer 2 media that the packets may traverse from end-to-end make it difficult to calculate the real bandwidth that should be provisioned at Layer 2. However, the conservative rule that has been thoroughly tested and widely used is to over-provision video bandwidth by 20%. This accommodates the 10% burst and the network overhead from Layer 2 to Layer 4.
As mentioned earlier video endpoints do not report a MOS as such. However the following tools could be used to measure/monitor the transport network performance, and to monitor video quality.
IP SLAs Video
A feature embedded in IOS, IP SLAs (Service Level Agreements) performs the active monitoring of the network performance. IP SLAs video operation differs from other IP SLA operations in that all traffic is one way only, with a responder required to process the sequence numbers and time stamps locally and to wait for a request from the source before sending the calculated data back.
The source sends a request to the responder when the current video operation is done. This request signals the responder that no more packets will arrive, and that the video sink function in the video operation can be turned off. When the response from the responder arrives at the source, the statistics are read from the message, and the relevant fields in the operation are updated.
CiscoWorks IPM (IOS Performance Monitor) uses IP SLA probe and MediaTrace to measure user traffic performance and reports.
The VQM (Video Quality Monitor) feature, available on the CUBE, is a great tool to monitor video quality between two points of interest. Results are presented as MOS.
This is available from IOS 15.2(1)T and above. Note that VQM uses DSP resources.
 The default value for this parameter is 200ms of traffic at priority bandwidth. The Cisco LLQ algorithm has been implemented to include a default burst parameter equivalent to 200 ms worth of traffic. Testing has shown that this burst parameter does not require additional tuning for a single IP Videoconferencing (IP/VC) stream. For multiple streams, this burst parameter may be increased as required.
 An auxiliary video stream is a 5 fps video channel for sharing presentations or other collateral vial the data projector.
 Note that some systems use the “AS” (Application Specific) modifier to convey maximum bandwidth. Interpretation of this attribute is dependent on the application's notion of maximum bandwidth.
CUBE is agnostic as to the specific bandwidth modifier (TIAS or AS).
 Mediatrace is an IOS Software feature that discovers the routers and switches along the path of an IP flow.