Cisco MediaSense Solution Reference Network Design Guide, Release 10.0(1)
Characteristics and features

Characteristics and features

This section provides design-level details on compliance recording, direct inbound and outbound recording, and monitoring using MediaSense.

Compliance recording

In compliance recording, calls are configured to always be recorded.

For IP phone recording, all calls received by or initiated by designated phones are recorded. Individual lines on individual phones are enabled for recording by configuring them with an appropriate recording profile in Unified Communications Manager.

For CUBE recording, all calls passing through the CUBE that match particular dial peers (typically selected by dialed number pattern) are recorded. MediaSense itself does not control which calls are recorded (except to the limited extent described under Incoming call handling rules).

Compliance recording differs from selective recording because in selective recording, the recording server determines which calls it will record. MediaSense itself does not support selective recording, but the effect can be achieved by deploying MediaSense in combination with certain partner applications.

Recording is accomplished by media forking, where basically the phone or CUBE sends a copy of the incoming and outgoing media streams to the MediaSense recording server. When a call originates or terminates at a recording-enabled phone, Unified Communications Manager sends a pair of SIP invitations to both the phone and the recording server. The recording server prepares to receive a pair of real-time transport protocol (RTP) streams from the phone. Similarly, when a call passes through a recording-enabled CUBE, the CUBE device sends a SIP invitation to the recording server and the recording server prepares to receive a pair of RTP streams from the CUBE.

This procedure has several implications:

  • Each recording session consists of two media streams (one for media flowing in each direction). These two streams are captured separately on the recorder, though both streams (or tracks) end up on the same MediaSense recording server.
  • Most, but not all, Cisco IP phones support media forking. Those which do not support media forking cannot be used for phone-based recording.
  • Though the phones can fork copies of media, they cannot transcode. This means that whatever codec is negotiated by the phone during its initial call setup, is the codec used in recording. MediaSense supports a limited set of codecs; if the phone negotiates a codec which is not supported by MediaSense, the call will not be recorded. The same is true for CUBE recordings.
  • The recording streams are set up only after the phone's primary conversation is fully established, which could take some time to complete. Therefore, there is a possibility of clipping at the beginning of each call. Clipping is typically limited to less than two seconds, but it can be affected by overall CUBE, Unified Communications Manager, and MediaSense load; as well as by network performance characteristics along the signaling link between CUBE or Unified Communications Manager and MediaSense. MediaSense carefully monitors this latency and raises alarms if it exceeds certain thresholds.

MediaSense does not initiate compliance recording. It only receives SIP invitations from Unified Communications Manager or CUBE and is not involved in deciding which calls do or do not get recorded. The IP phone configuration and the CUBE dial peer configuration determine whether media should be recorded. In some cases, calls may be recorded more than once, with neither CUBE, Unified Communications Manager, nor MediaSense being aware that it is happening.

This would be the case if, for example, all contact center agent IP phones are configured for recording and one agent calls another agent. It might also happen if a call passes through a CUBE which is configured for recording and lands at a phone which is also configured for recording. The CUBE could end up creating two recordings of its own. However, MediaSense stores enough metadata that a client can invoke a query to locate duplicate calls and selectively delete the extra copy.

At this time, only audio streams can be forked by Cisco IP phones and CUBE. Compliance recording of video media is not supported; it is only available for the blogging modes of recording. CUBE is capable of forking the audio streams of a video call and MediaSense can record those, but video-enabled Cisco IP phones do not offer this capability.

MediaSense can record calls of up to eight hours in duration.

Conferences and transfers

MediaSense recordings are made up of one or more sessions where each media forking session contains two media streams—one for incoming and one for the outgoing data. A simple call consisting of a straightforward two-party conversation is represented entirely by a single session. MediaSense uses metadata to track which participants are recorded in which track of the session, as well as when they entered and exited the conversation—but it cannot always do so when conferences are involved.

When sessions included transfer and conference activities, MediaSense does its best to retain the related information in its metadata. If a recording gets divided into multiple sessions, metadata is also available to help client applications correlate those sessions together.


A multi-party conference is also represented by a single session with one stream in each direction, with the conference bridge combining all but one of the parties into a single MediaSense participant. There is metadata to identify that one of the streams represents a conference bridge, but MediaSense does not receive the full list of parties on the conference bridge.


Transfers behave differently depending on whether the call is forked from a Unified Communications Manager phone or from a CUBE.

With Unified Communications Manager recordings, the forking phone anchors the recording. Transfers that drop the forking phone terminate the recording session but transfers that keep the forking phone in the conversation do not.

With CUBE forking, the situation is more symmetric. CUBE is an intermediary network element and neither party is an anchor. Transfers on either side of the device are usually accommodated within the same recording session. (See Solution-level deployment models for more information.)

Hold and pause

Hold and pause are two concepts sound similar, but they are not the same.

  • Hold (and resume) takes place as a result of a user pressing a key on his or her phone. MediaSense is a passive observer.
  • Pause (and resume) takes place as a result of a client application issuing a MediaSense API request to temporarily stop recording while the conversation continues.

Hold behavior differs depending on which device is forking media. In Unified Communications Manager deployments, one party places the call on hold, blocking all media to or from that party's phone while the other phone typically receives music (MOH). If the forking phone is the one that invokes the hold operation, Unified Communications Manager terminates the recording session and creates a new recording session once the call is resumed. Metadata fields allow client applications to gather together all of the sessions in a given conversation.

If the forking phone is not the one that invokes the hold operation, the recording session continues without a break and even includes the music on hold—if it is unicast (multicast MOH does not get recorded).

For deployments where Unified Communications Manager phones are configured for selective recording, there must be a CTI (TAPI or JTAPI) client that proactively requests Unified Communications Manager to begin recording any given call. The CTI client does not need to retrigger recording in the case of a hold and resume.

For CUBE deployments, hold and resume are implemented as direct SIP operations and the SIP protocol has no direct concept of hold and resume. Instead, these operations are implemented in terms of media stream inactivity events. MediaSense captures these events in its metadata and makes it available to application clients, but the recording session continues uninterrupted.

The Pause feature allows applications such as Customer Relationship Management (CRM) systems or VoiceXML-driven IVR systems to automatically suppress recording of sensitive information based on the caller's position in a menu or scripted interaction. Pause is invoked by a MediaSense API client to temporarily stop recording, and the subsequent playback simply skips over the paused segment. MediaSense does store the information in its metadata and makes it available to application clients.

Pause behaves identically for CUBE and Unified Communications Manager recording.

Direct inbound recording

In addition to compliance recording controlled by a CUBE or a Unified Communications Manager recording profile, recordings can be initiated by directly dialing a number associated with a MediaSense server configured for automatic recording. These recordings are not carried out through media forking technology and therefore are not limited to CUBE or Cisco IP phones, nor are they limited to audio media. This is how video blogging is accomplished.

Direct outbound recording

Using the MediaSense API, a client requests MediaSense to call a phone number. When the recipient answers, the call is recorded similarly to the way it is recorded when a user dials the recording server in a direct Inbound call. The client can be any device capable of issuing an HTTP request to MediaSense, such as a 'call me' button on a web page. Any phone, even a non-IP phone (like a home phone), can be recorded if it is converted to IP using a supported codec. Supported IP video phones can also be recorded in this way.

Direct outbound recording is only supported if MediaSense can reach the target phone number through a Unified Communications Manager system. In CUBE-only deployments where Unified Communications Manager is not used for call handling, direct outbound recording is not supported.


While a recording is in progress, the session is monitored by a third-party streaming-media player or by the built-in media player in MediaSense.

To monitor a call from a third-party streaming-media player, a client must specify a real time streaming protocol (RTSP) URI that is prepared to supply HTTP-BASIC credentials and is capable of handling a 302 redirect. The client can obtain the URI either by querying the metadata or by capturing session events.

MediaSense offers an HTTP query API that allows suitably authenticated clients to search for recorded sessions based on many criteria, including whether the recording is active. Alternatively, a client may subscribe for session events and receive MediaSense Symmetric Web Service (SWS) events whenever a recording is started (among other conditions). In either case, the body passed to the client includes a great deal of metadata about the recording, including the RTSP URI to be used for streaming.

The third-party streaming-media players that Cisco has tested for MediaSense are VLC and RealPlayer. Each of these players has advantages and disadvantages that should be taken into account when selecting which one to use.

Recording sessions are usually made up of two audio tracks. MediaSense receives and stores them that way and does not currently support real time mixing.

VLC is capable of playing only one track at a time. The user can alternate between tracks but cannot hear both simultaneously. VLC is open source and is easy to embed into a browser page.

RealPlayer can play the two streams as stereo (one stream in each ear) but its buffering algorithms for slow connections sometimes results in misleading periods of silence for the listener. People are more or less used to such delays when playing recorded music or podcasts, but call monitoring is expected to be real time and significant buffering delays are inappropriate for that purpose.

None of these players can render AAC-LD, g.729 or g.722 audio. A custom application must be created in order to monitor or play streams in those forms.

MediaSense's built-in media player is accessed by a built-in Search and Play application. This player covers more codecs and can play both streams simultaneously, but it cannot play video, and it cannot support the AAC-LD codec. This applies to both playback of recorded calls and monitoring of active calls.

Only calls that are being recorded are available to be monitored. Customers who require live monitoring of unrecorded calls, or who cannot accept these other restrictions, may wish to consider Unified Communications Manager's Silent Monitoring capability instead.


Once a recording session has completed, it can be played back on a third-party streaming-media player or through the built-in media player in the Search and Play application. Playing it back through a third-party streaming-media player is similar to monitoring—an RTSP URI must first be obtained either through a query or an event.

Silence suppression

While recording a call, it is possible to create one or more segments of silence within the recording (for example by invoking the pauseRecording API). Upon playback, there are various ways to represent that silence. The requesting client uses a set of custom header parameters on the RTSP PLAY command to specify one of the following:

  1. The RTP stream pauses for the full silent period, then continues with a subsequent packet whose mark bit is set and whose timestamp reflects the elapsed silent period.
  2. The RTP stream does not pause. The timestamp reflects the fact that there was no pause, but the RTP packets contain "TIME" padding which includes the absolute UTC time at which the packet was recorded.
  3. The RTP stream compresses the silent period to roughly half a second; in all other respects it acts exactly like bullet 1. This is the default behavior and is how the built-in media player works.

In all cases, the file duration returned by the RTSP DESCRIBE command reflects the original record time duration. It is simply the time the last packet ended minus the time the first packet began.

The session duration returned by the MediaSense API and session events may differ because these are based on SIP activity rather than on media streaming activity.

Commercial media players such as VLC and RealPlayer elicit the default behavior described in bullet 3. However, these players are designed to play music and podcasts, they are not designed to handle media streams that include silence—so they may hang, disconnect, or not seek backwards and forwards in the stream.

Conversion and download

Completed recording sessions can be converted on demand to .mp4 or .wav format via an HTTP request. Files converted this way format carry two audio tracks—not as a mixed stream, but as stereo. Alternatively, .mp4 files can also carry one audio and one video track.

After conversion, .mp4 and .wav files are stored for a period of time in MediaSense along with their raw counterparts and are accessible using their own URLs. (The files eventually get cleaned up automatically, but are recreated on demand the next time they are requested.) As with streaming, browser or server-based clients can get the URIs to these files by either querying the metadata or monitoring recording events. The URI is invoked by the client to play or download the file.

As with RTSP streaming, the client must provide HTTP-BASIC credentials and be prepared to handle a 302 redirect. In this way, conversion to .mp4 or .wav format provides a secure, convenient, and standards-compliant way to package and export recorded sessions.

However, large scale conversion to .mp4 or .wav takes a lot of processing power on the recording server and may impact performance and scalability. To meet the archiving needs of some organizations, as well as to serve the purposes of those speech analytics vendors who would rather download recordings than stream them in real time, MediaSense offers a "low overhead" download capability.

This capability allows clients using specific URIs to download unmixed and unpackaged individual tracks in their raw g.722, g.711 or g.729 format. The transport is HTTP 1.1 chunked, which leaves it up to the client (and the developer's programming expertise) to reconstitute and package the media into whatever format best meets its requirements. As with the other retrieval methods, the client must provide HTTP-BASIC credentials and be prepared to handle a 302 redirect. Note that video streams and AAC-LD encoded audio streams cannot currently be downloaded in this way.

Embedded Search and Play application

MediaSense provides a web-based tool used to search, download, and playback recordings. This Search and Play application is accessed using the API user credentials.

The tool searches both active and past recordings based on metadata characteristics such as time frame and participant extension. Recordings can also be selected using call identifiers such as Cisco-GUID or Unified CM call leg identifier. Once recordings are selected, they may be individually downloaded in mp4 or .wav format or played using the application's built-in media player.

The Search and Play tool is built using the MediaSense REST-based API. Customers and partners interested in building similar custom applications can access this API from the DevNet (formerly known as the Cisco Developer Network).

Support for the Search and Play application is limited to clusters with a maximum of 400,000 sessions in the database. Automatic pruning provides the capability to adjust the retention period to ensure that this limitation is respected using the following formula:

Retention Setting in Days = 400,000 / (avg # agents * avg # calls per hour * avg # hours per day)

For example, if you have 100 agents taking 4 calls per hour, 8 hours per day every day, you can retain these sessions for 125 days before exceeding the 400,000 session limit. This is acceptable for most customers, but if you have 1000 agents taking 30 calls per hour, 24 hours per day every day, your retention period is about half a day. The Search and Play application cannot be used in this kind of environment.


Additional reasons for limiting the retention period are described in Scalability and Sizing.

Embedded streaming media player

Telephone recording uses a different set of codecs than those typically used for music and podcasts. As a result, most off-the-shelf media players are not well suited to playing the kind of media that MediaSense records. This is why partner applications generally provide their own media players, and why MediaSense has the built-in Search and Play application.

The embedded player supports g.729, g.711, and g.722 codecs. This applies to both playback of recorded calls and monitoring of active calls.

The embedded media player can be accessed through the Search and Play application or it can be used by a 3rd party client application. Such an application can present a clickable link to the user that, when clicked, loads the recording-specific media player for the selected recording session into the user's browser. This allows partners who do not have sophisticated user interface requirements to avoid the complexity of either developing their own media player or incorporating an off the shelf media player into their applications.

Uploaded videos to support ViQ, VoD and VoH features

MediaSense supports the Cisco Contact Center Video in Queue, Video on Demand, and Video on Hold features by enabling administrators to upload .mp4 video files for playback on demand.

To use these features, users must:

  1. Produce an .mp4 video that meets the technical specifications outlined below.
  2. Upload the .mp4 video to the MediaSense Primary node. The video is automatically converted into a form that can be played back to a supported video endpoint and distributed to all other nodes. Playback is automatically load balanced across the cluster.
  3. Create an "incoming call handling rule" that maps a particular incoming dialed number to the uploaded video. You may also specify whether this video should be played once or repeated continuously.

Administrative user interfaces are provided for uploading the file to MediaSense and creating the incoming call handling rule. These functions are not available through the MediaSense API.

An .mp4 file is a container that may contain many different content formats. MediaSense requires that the file content meet the following specifications:

  • The file must contain one audio track and one video track.
  • The video must be encoded using the H.264.
  • The audio must be encoded using AAC-LC.
  • The audio must be monaural.
  • The entire .mp4 file size must not exceed 2GB.

The preceding information is known as the Studio Specification. It must be provided to any professional studio that is producing video content for this purpose. Most commonly available consumer video software products can also produce this format.


Video resolution and aspect ratio are not enforced by MediaSense. MediaSense will play back whatever resolution it finds in an uploaded file, so it is important to use a resolution that looks good on all the endpoints on which you expect the video to be played. Many endpoints are capable of up- or down-scaling videos as needed, but some (such as the Cisco 9971) are not. For the best compatibility with all supported endpoints, use standard VGA resolution (640x480).

Cisco endpoints do not support AAC-LC audio (which is the standard for .mp4), so MediaSense automatically converts the audio to AAC-LD, g.711 µlaw, and g.722 (note that g.711aLaw is not supported for ViQ/VoH). MediaSense automatically negotiates with the endpoint to determine which audio codec is most suitable. If MediaSense is asked to play an uploaded video to an endpoint which supports only audio, then only the audio track will be played.

Video playback capability is supported on all supported MediaSense platforms, but there are varying capacity limits on some configurations. See the "Hardware Profiles" section below for details.

MediaSense comes with a sample video pre-loaded and pre-configured for use directly out of the box. After successful installation or upgrade, dial the SIP URL sip:SampleVideo@<mediasense-hostname> from any supported endpoint or from Cisco Jabber Video to see the sample video.

Integration with Unity Connection for video voice-mail

Beginning with Cisco Unity Connection (CUC) release 10.0(1), configured subscribers have the option to record video greetings in addition to audio greetings. Subscribers who are configured to record video greetings and who are calling from a video capable IP endpoint are presented with additional prompts to record their video greeting. These recordings (both the audio and video tracks) are stored and played back from MediaSense. A separate audio-only copy of the recording remains on Unity Connection as well.

If for any reason Unity Connection is not able to play a video greeting from MediaSense, it reverts to its locally stored audio greeting.

This is an introductory implementation and therefore contains a number of limitations.

More information about the Cisco Unity Connection integration, including deployment and configuration instructions, can be found in the Unity Connection documentation.

Integration with Finesse and Unified CCX

MediaSense is integrated with Cisco Finesse and Unified Contact Center Express (Unified CCX). The integration is both at the desktop level and at the MediaSense API level.

At the desktop level, MediaSense's Search and Play application has been adapted to work as an OpenSocial gadget that can be placed on a Finesse supervisor's desktop. In this configuration, MediaSense can be configured to authenticate against Finesse rather than against Unified CM. Therefore, any Finesse user who has been assigned a supervisor role can search and play recordings from MediaSense directly from his or her Finesse desktop. (A special automatic sign-on has been implemented so that when the supervisor signs in to Finesse, he or she is also automatically signed into the MediaSense Search and Play application.) Note that other than this sign-in requirement, there are currently no constraints on access to recordings. Any Finesse supervisor has access to any and all recordings.

At the API level, Unified CCX subscribes for MediaSense recording events and matches the participant information it receives with the agent extensions that it knows about. It then immediately tags those recordings in MediaSense with the agentId, teamId, and if it was an ICD call, the contact service queue identifier (CSQId) of the call. This allows the supervisor, through the Search and Play application, to find recordings which are associated with particular agents, teams, or CSQs without having to know the agent extensions.

This integration uses BiB forking, selectively invoked through JTAPI by Unified CCX. Because Unified CCX is in charge of starting recordings, it is also in charge of managing and enforcing Unified CCX agent recording licenses. However, other network recording sources (such as un-managed BiB forking phones or CUBE devices) could still be configured to direct their media streams to the same MediaSense cluster, which could negatively impact Unified CCX's license counting.

For example, Unified CCX might think it has 84 recording licenses to allocate to agent phones as it sees fit, but it may find that MediaSense is unable to accept 84 simultaneous recordings because other recording sources are also using MediaSense resources. This also applies to playback and download activities—any activity that impacts MediaSense capacity. If you are planning to allow MediaSense to record other calls besides those that are managed by Unified CCX, then it is very important to size your MediaSense servers accordingly.

More information about this integration, including deployment and configuration instructions, can be found in the Unified CCX documentation.

Integration with Unified CM for Video on Hold and native queuing

Starting with Unified CM Release 10.0, customers can configure a Video on Hold source for video callers, similar to a Music on Hold source that is used for audio callers. The same facility is used to provide pre-recorded video to callers who are waiting for a member of a hunt group to answer. This is known as "CUCM native queuing."

MediaSense can be used as the video media server for both purposes. To use MediaSense in this way, administrators make use of the product's generic ability to assign incoming dialed numbers to various uploaded videos, which are then played back when an invitation arrives on those dialed numbers. Unified CM causes one of these videos to play by temporarily transferring the call to the corresponding dialed number on MediaSense.

See Uploaded Videos to support ViQ, VoD and VoH features for more information.

For instructions on configuring these features in Unified CM, see the relevant Unified CM documentation.

Integration with Cisco Remote Expert

MediaSense integrates with the Cisco Remote Expert product in two areas:

  • It can act as a video media server for ViQ, VoH, and Video IVR.
  • It can record the audio portion of the video call.

MediaSense's video media server capabilities satisfy Remote Expert's needs for ViQ, VoH, and Video IVR. See Uploaded Videos to support ViQ, VoD and VoH features for more information.

Calls that are to be recorded must be routed through a CUBE device that is configured to fork its media streams to MediaSense (because most of the endpoints used for Remote Expert are not able to fork media themselves). All the codecs listed in Codecs supported are supported, except for the video codec, H.264. If your version of IOS does fork video along with the audio streams, MediaSense will only capture the audio. Please consult the Compatibility Matrix to ensure that your CUBE is running a supported version of IOS, to ensure that you incorporate several bug fixes in this area.

Remote Expert provides its own user interface portal for finding and managing recordings, and for playing them back. For AAC-LD audio calls (most common when using EX-series endpoints), there are no known RTSP-based AAC-LD streaming media players, so those calls can only be converted to .mp4 and downloaded for playback. Live monitoring of such calls is not possible.

For more information about this integration, including deployment and configuration instructions, see the Remote Expert documentation.

Incoming call handling rules

When MediaSense receives a call, it needs to know what action to take. In Releases 9.0(1) and earlier, all incoming calls would simply be recorded—irrespective of the dialed number to which the call was addressed. As of Release 9.1(1), you have the option to configure what action MediaSense takes for each call type. The following actions are available:

  • Record the call.
  • Reject the call.
  • Play a specified uploaded video once.
  • Play a specified uploaded video repetitively.

If your application is to record calls forked by a CUBE, then the dialed number in question is configured as the "destination-pattern" setting in the dial peer which points to MediaSense. If your application is to record calls forked by a Unified Communications Manager phone, then the dialed number in question is configured as the recording profile's route pattern.

For compatibility with earlier releases, all incoming addresses (except for SampleVideo) are configured to record.