Message Problem

Products / Plugins：Video Call / Voice Call

Platform / Framework：iOS / Android / macOS / Windows

Last updated：2021-09-09 19:57

Scenes to be Used

When developers have higher frequency and real- time requirements for message sending, and the loss of messages will not affect business logic, it is recommended to use SEI (Supplemental Enhancement Information). Mainly used in the following scenarios:

One-way sending of large concurrent IM scenarios
Scenarios that require real-time synchronization of text information with media streams, such as:
- Live answering questions
- Lyrics synchronization
- Single stream custom sound wave
- Precise control of the layout replacement of mixed-stream video screens

The media supplementary enhanced information has the following advantages and disadvantages:

Advantages: high real-time; text messages and media streams are synchronized; there is no limit on the number of recipients, and the stream can be received.
Disadvantages: Since the transmission protocol of the media stream is UDP, there is a possibility of packet loss. When a certain video frame is lost during transmission, the supplementary media enhancement information on the video frame will also be lost.

Concept Explanation

H.264 Source Code Stream Composition Structure

The original H.264 stream is composed of NALU one after another. In order to facilitate the extraction of NALU from the byte stream, the protocol stipulates that the start code (StartCode): 0x000001 or 0x00000001 is added in front of each NALU.

NALU Composition Structure

NALU (NAL Unit) = a set of NALU header information (NAL header) corresponding to the video encoding + an RBSP (Raw Byte Sequence Payload)

NALU Header Composition Structure

NALU Header is composed of 8 bits, the last 5 bits represent NAL Unit Type, the specific structure is shown in the figure below:

NAL Unit Type The usual types are as follows:

NAL Unit Type	NAL Unit Content
1	Fragments that are not IDR images and do not use data division.
5	IDR image.
6	Supplemental Enhancement Information (SEI).
7	Sequence Parameter Set (SPS).
8	Picture Parameter Set (PPS).
11	End of stream.

SEI Payload Type Calculation Method

When starting to parse the NAL of the SEI type, continue to read 8 bits in the RBSP until it is not 0xff, and then accumulate the read value, and the accumulated value is the SEI payload type.

The structure of SEI RBSP is as follows:

SEI Payload Size Calculation Method

The logic of reading the SEI payload size is similar to that of the SEI payload type, that is, it is read until it is not 0xff, which can support any length of SEI payload addition. Assuming that the character sequence after SEI payload type is FF FF AA BB..., then FF FF AA will be parsed into SEI payload size, which is 255 + 255 + 170 = 680.

Internal Key Implementation

1. The timing of injecting and extracting media supplements and enhancing information

Injection timing: after encoding, before network transmission.

Extraction timing: after Jitter buffering, before decoding.

The advantages are as follows:

A set of codes implements the injection logic and extraction logic of supplementary enhanced information for media packaged in sub-media mode and SEI mode.
In the absence of a video module and a video encoding module, supplementary media enhancement information can be sent normally.
It is convenient to realize NALU assembly and unpacking in a custom format.

The disadvantages are as follows: When the encoder finishes encoding a NALU, it will detect whether there is a keyword sequence inside the NALU, such as "StartCode", and do "prevent competition" processing. However, ZEGO did not implement this, but reported an error internally, intercepting the sending of supplemental and enhanced information from the media.

2. 4K Limit

Media supplemental enhanced information is mainly used for text information transmission. In order to ensure that it does not occupy too much bandwidth, the total length of inData that ZEGO now transmits cannot be greater than 4096 Bytes.

3. Drive Transmission Medium

Audio and Video

In the audio and video scenario, the video will be used to drive the transmission of media supplementary enhanced information, and the timestamp uses the timestamp of the video. After the video encoder compiles a frame, each media supplementary enhancement information in the sending queue will be assembled into NALUs and used for sending and transmission. Since the video encoding is used to drive the media supplemental enhancement information transmission, if the video encoder does not generate frames (such as turning off the camera), the media supplemental enhancement information will not be sent normally.

Pure Audio

In a pure audio scenario, audio will be used to drive the media to supplement the transmission of enhanced information, and the timestamp uses the audio timestamp. After the audio encoder compiles a frame, each media supplementary enhancement information in the sending queue will be assembled into a NALU and used for sending and transmission.

Different audio encoding formats have different encoding frame rates. For example, the encoding frame rate of AAC-LC is 20-25 frames, and the encoding frame rate of OPUS is about 50 frames. Driven by audio and using OPUS encoding, it can support 50 messages per second. The audio-driven media supplements the transmission of enhanced information, and the video streaming will be ignored.

If audio is used to drive the media to supplement the transmission of enhanced information, if the video is sent at this time, there will be two timestamps for the video frame (audio-based timestamp and video-based timestamp). Due to factors such as different devices and processes, there is a certain gap between audio timestamps and video timestamps, and time may go backwards.

In the same way, it can be analyzed why the switching of pure audio drive and video drive media supplementary enhancement information is not supported.

The following figure shows the problem of the sending time regressing when the midway switching is allowed:

4. The Fifference Between Single Frame Sending and Video Frame Sending

Single frame transmission: Each NALU carrying media supplementary enhancement information will be sent and transmitted in the form of a video frame separately.
Send along with the video frame: insert the media supplemental enhancement information NALU that needs to be transmitted into the original stream of the video frame.

5. Sending Queue Size Limit

On the media supplement and enhancement information sending end, the internal sending queue maintains up to 10 media supplement enhancement information. Assuming that there is no medium to drive the transmission of media supplement enhancement information or the frequency of media supplement enhancement information is sent too frequently, there are too many insertions within the drive transmission interval. For 10 media supplementary enhancement information, the supplementary enhancement information of the media that cannot be inserted into the sending queue will be directly discarded.

6. SideInfoZegoDefined Structure

7. Receiver Data Encapsulation

In order to be compatible with the structure of SideInfoZegoDefined, after receiving the media supplemental enhancement information, the receiving end will encapsulate it into the following format:

The common types of MediaType are as follows:

MediaType	NAL Unit Content
1001	The SDK adopts the media information packaged in the sub-media way.
1003	Layout information packaged by the mixed stream server.
1004	The information with payload type = 5 in the SEI, that is, the "mediaInfoType" parameter is set to "SeiUserUnregisted" when the switch to send media supplementary enhanced information is turned on.
1005	The information with payload type = 243 in the SEI, that is, the "mediaInfoType" parameter is set to "SeiZegoDefined" when the switch to send media supplementary enhanced information is turned on.

FAQs