27 March 2010

Vivox in SL: client, server and protocols

Server-side

As I wrote before and based on Vivox' white paper, the main point is the Server-side mixing all voices in real time and delivering the audio in a single stream. I could not find much more information about the SL-specific Vivox server system (Linden Lab will not reveal their server-side architecture that easily), but I guess the Vivox server-side does not differ a lot between MMOG/VW. unused but required parameters can be found in the SLVoice documentation. This suggests either Vivox cared for a retrocompatibility with the SLVoice Application 1.0 or Vivox did not tailor their client API to SL needs. I am for the later, even though I could not find any documentation for the SLVoice Application 1.0 infirming or confirming that.

Joe Miller explained how the Vivox server sends the audio stream to users and how the system can scale:

According to Miller, the VoIP product is unique because of the ability to project the sound in three dimensional space, as a function of distance and direction from one avatar to another. It takes a 32khz signal at 32kbps from clients, sends it to an Intel based audio server where the input signals are mixed and properly positioned, acoustically, in three dimensions, and a stereo stream is sent back to the client at 64kbps. Even with 100 people speaking at once, the bandwidth requirements are the same for each individual because the servers (dual quad-core Xeons) mix the voices together into a single data stream.
The codec used is Siren 14/G.722.1 Annex C, developed by Polycom but now an international standard. It was chosen because it uses relatively low bandwidth but can carry a wide and dynamic range of audio – not just human voices – making it an ideal codec to broadcast, say, a musical event.

The range at which other resident can hear each other are explained in the SL wiki article "How far does my voice carry". Similarly to text-chat, the server computes the distances between people to determine who hears who, and sends appropriate messages after this computation. Hence (and hopefully) it's impossible to use a modified Second Life Viewer to remove the hearing range limits.

The OpenSim server architecture might not differ a lot from the SL one regarding to voice support. However, I could not find it reading the OpenSim wiki.

Client-side

On the client-side, Linden Lab have chosen to keep the voice features outside of the Viewer: The Second Life Viewer handles configuration, control, and display functions, but the voice streams (from the microphone and from the Vivox voice server) do not enter the Viewer. In other words, These [voice] technologies are contained in external daemon software that is started and stopped by the Second Life client.

The requestId can/should be a GUID so that each response matches a unique request. Each gateway response also contains the request it received. This enables the XML-based protocol to be stateless. TCP provides a reliable transmission that prevents packet loss (important to update the UI reliably and in a timely manner).

voipforvw is a GPL alternative for SLVoice on OpenSim. One of its developer wrote it is a snap-in replacement for this executable [SLVoice.exe] that communicates with the viewer and as you’d guess, does the heavy lifting and coding/decoding. But the project started in February 2008 and has not received any commit since May 2009.

More about the client components in an incoming article ...

Voice protocols (in a nutshell)

The following protocols or techniques are used by some components in the SL client.

SIP is an application-layer protocol and incorporates many elements of HTTP such as headers, encoding rules and status codes. As indicated by its name, SIP is only used to initiate communications between clients. Clients start communicating in peer-to-peer after they have been paired by a SIP server. The SL Viewer uses ports 5060 (non-encrypted) and 5062 (TLS-encrypted) for SIP with UDP. Once clients are paired, they can start exchanging data.

ICE is not a protocol but rather an initialization technique that facilitates peer-to-peer communications in reducing the NAT-traversal delay. It uses a STUN client-server strategy to pair agents. When paired, agents do not rely on the server anymore.

RTP is an application-layer protocol that defines a packet format for delivering audio and video. RTP Use Scenarios in the RFC contain multicast, Mixers and Translators. The use of UDP for the transport layer is obvious in this real-time "send-at-most-once" media-streaming context. The SL Viewer uses the 12000 to 15000 (or 13000?) port range for RTP.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.