Before jumping in to the details, let us take a brief background of H.264 packetization. The encoder encodes sequence of frames or pictures to generate the encoded stream, which is consumed by the decoder to re-create the video. The encoder generates what is called as NALU or network abstraction layer unit. The decoder works on a single NALU and needs sequence of NALUs to decode. Each frame can have one or more slices. Each slice can be encoded in one or more NALUs. There are certain pieces of information that remain same for all or many frames. For example, the sequence parameter set (SPS) and picture parameter set (PPS) are like configuration elements that need to be sent once or only occasionally instead of with every frame or NALU. The configuration parameters apply to the encoder, whereas the decoder should be able to decode any configuration.
Flash Player 11+ is capable of capturing from camera and encoding in H.264 to send to an RTMP stream. Each RTMP message contains header and data (or payload), where the header contains crucial information such as timestamp, stream identifier, and the payload contains the encoded video NALUs or actual configuration data. The format of the payload is same as that of the F4V/FLV tag for H.264 video in an FLV file. Each RTMP message contains one frame but may contain more than one NALUs. The first byte contains the encoding type, and for H.264 is either 0x17 (for intra-frame) or 0x27 (for non-intra frame). The second byte contains packet type and is either 0x00 (configuration data) or 0x01 (picture data). The configuration data contains both SPS and PPS as described here.
rtmp-payload := enc-type[1B] | type[1B] | remaining
enc-type := is-intra[4b] | codec-type[4b]
is-intra := 1 if intra and 2 if non-intra
codec-type := 7 for H.264/AVC
If the type is configuration data then the next four bytes are configuration version (0x01), the profile index, the profile compatibility and the level index. This is followed by one byte containing least-significant two-bits that determine the number of bytes to use for the length of the NALU in subsequent picture data messages. For example, if the bits are 11b then it indicates 3+1=4 bytes of NALU length, and if the bits are 01b then it indicates 1+1=2 bytes of NALU length. Lets call this the length-size and possible values are 1, 2 or 4. This is followed by a byte containing least-significant 5 bit for the number of subsequent SPS blocks. Each SPS block is prefixed by 16-bits length followed by the bit-wise encoding of SPS as per H.264 specification. This is followed by a byte containing the number of subsequent PPS blocks. Each PPS block is prefixed by 16-bits length followed by the bit-wise encoding of PPS as per H.264 specification. Typically only one SPS and one PPS blocks are present.
remaining for config := version[1B] | profile-idc[1B]
| sps-count[1B] | sps0 ...
| pps-count[1B] | pps0 ...
length-flag := 0[6b] | value[2b] where value + 1 is length-size
sps-count := 0[3b] | count[5b] where count is number of sps
pps-count := number of pps elements
sps(n) := length[2B] | sps
pps(n) := length[2B] | pps
If the type is picture data then the next three bytes contain a 24-bit number for the decoder delay value for the frame and is applicable only for B-frames. The default baseline profile does not include the B-frames. Thus the first five bytes of the picture data payload are like header data. This is followed by one or more NALU blocks. Each NALU block is prefixed by the length of the next NALU encoded-bits. The number of bytes used to encode this length is determined by length-size mentioned earlier. Then the NALU is encoded as per H.264 specification.
remaining-picture := delay[3B] | nalu0 | nalu1 ...
nalu(n) := length | nalu
length := number in length-size bytes
nalu := NAL unit as per H.264
Each NALU has first byte of flags. The flags contains 1 most-significant bit of forbidden, next 2-bits of nri (NAL reference index) and final 5 least-significant bits of nal-type. There are several nal-types such as 0x01 for non-intra regular pictures, 0x05 for intra-pictures, etc. Please see the H.264 specification for the complete list.
The camera captured and encoded data in Flash Player contains three NALUs in each RTMP message -- the access unit delimiter (nal-type 0x06), the timing-information (nal-type 0x09) and the picture slice (nal-type 0x01 or 0x05). The Flash Player is capable of decoding other nal-types as well, and does not require access unit delimiter or timing-information NALUs for decoding. I haven't seen any support for aggregated or fragmented NALUs in the Flash Player.
The RTP payload format for H.264 is specified in RFC 6184 and is typically supported in SIP-based video phones. The RTP header contains the crucial information such as the payload type, the timing data, and the sequence number, whereas the actual configuration and picture NALUs are sent in the payload as specified by this RFC. The first byte is the type containing one bit forbidden, two bits of nri and 5 bits of nal-type.
nalu := nal-flags[1B] | encoded-data
nal-flags := forbidden[1b] | nri[2b] | nal-type [5b]
In addition to the base nal-types of H.264, the RFC defines new nal-types for fragmentation and aggregation. Traditionally, the Internet plagued by middle-boxes, NATs and firewalls has imposed a limit on the size of the UDP packet that can be pragmatically used on the Internet, and the typically MTU is around 1400-1500 bytes. The H.264 encoder is capable of generating much larger encoded frame sizes hence cannot be successfully sent as one frame per RTP packet over UDP in many cases. On the other hand, some low-sized encoded frames may be much smaller than MTU thus incurring additional overhead for RTP headers. These low-sized frames can be aggregated for efficiency.
Many SIP video phones configure their H.264 encoders to use multiple slice NALUs in a single frame, unlike Flash Player which generates one picture NALU per frame. Thus the traditional SIP video phones are capable of using low sized encoded payload without RFC 6184 which can be sent in a single RTP/UDP packet.
When a large encoded frame is fragmented to smaller fragments, the nal-type=28 is used in the first byte of each fragment, followed by the second byte containing the actual nal-type of the frame as well as the start and end markers. This is followed by the actual encoded data. The RTP header of all these fragments contain the same timestamp value. The last fragment of the frame contains the marker set to true, whereas all the previous ones set it to false. When multiple smaller frames are aggregated, the nal-type of 24 is used in the first byte of the aggregate payload, followed by one or more NAL data. Each NAL data is prefixed by 16-bit length of the encoded NALU. There are non-trivial rules on how the nri is obtained and we refer you to the RFC for the details.
let encoded-data = fragment0 | fragment1 | fragment2...
encoded-data of fragment(n) := orig-nal-flags[1B] | fragment(n)
orig-nal-flags := start[1b] | end[1b] | ignore[1b]
start := 1 if first fragment else 0
end := 1 if last fragment else 0
encoded-data of aggregate := nalu0 | nalu1 | nalu2 ...
nalu(n) := length[2B] | orig-nalu(n)
In additional to sending the SPS and PPS packets in RTP, the video phones also negotiate the configuration data via external protocol such as SIP/SDP. Since Flash Player does not do that, we will not discuss it further.
Now that we understand the packetization of H.264 for Flash Player as well as SIP/RTP, let us go over the details of the translation process.
The configuration data is sent periodically by Flash Player before every intra-picture frame. However, SIP phones may not send the configuration data periodically. It is desirable to cache the configuration data received from both sides, and re-use it when the other side connects. The first packet sent must contain the configuration data. It is also desirable to periodically send the configuration data to both Flash Player and SIP sides from the translator, irrespective of whether the configuration data is received periodically. In our translator we send the configuration data before every infra frame.
In Flash Player to SIP/RTP direction, when the configuration data is received on RTMP, it is sent in two RTP packets, one for SPS and one for PPS. Both use the same timestamp and set the marker to true. When picture data is received on RTMP and need to be sent to the RTP side, it is dropped until a previous configuration data has been sent to the RTP side. If the picture data is not dropped, all the NALUs are extracted. The last out-of-three NALUs per RTMP message is the actual picture NALU which is sent to the RTP side as follows. Only the nal-type of 1 and 5 are used, whereas others are ignored. If the NAL size is less than 1500 bytes, it is used as is in the RTP payload with marker set to true. If the NAL size is more, it is fragmented in to smaller fragments with each of size at most 1500 bytes. Multiple fragmented RTP packets are generated as per the RFC. All but the last fragment has marker set to false. The RTP marker of true indicates end of frame. All the fragments use the same timestamp value.
In the SIP/RTP to Flash Player direction, the configuration data is received in multiple RTP packets and are cached by the translator. When both SPS and PPS payloads have been received from the RTP side, we are ready to start streaming to the Flash Player side. Any incoming RTP packet is put in a queue. When the last packet in the queue (that was most recently received) has marker set to true, the queue is examined and RTMP messages are created to be sent to the Flash Player side. Since Flash Player handles complete frames in each RTMP message, we need to wait until the marker is set to true so that we only send complete frames to Flash Player. If the RTMP stream is ready but we have not received the configuration data from RTP or we have not or are not sending the first intra frame to RTMP, then received packets are dropped. If no infra frames are received for 5 seconds, then we send a fast-intra-update (FIR) request to the SIP/RTP side, so that it triggers the SIP phone to send an intra frame.
Once we decide that we can send packets to RTMP from the received RTP queue, we divide the queue in to groups of packets of same timestamp and same nal-type values while preserving the order of the packets. If the nal-type is 5 indicating that an intra-frame is being sent to RTMP, then we send a configuration data too before the actual picture data. The configuration payload format is explained earlier and contains both PPS and SPS along with other elements. Each group of packets of the same timestamp and same nal-type is sent as a single RTMP message in the same order containing one or more NALUs. If the nal-type is 1 or 5, the NALU from the RTP payload is used as is in the RTMP payload with five bytes of header as explained earlier. If the nal-type is 28 indicating fragmented packets, then all the fragmented payloads are combined in to a single NALU. If the nal-type is 24 indicating aggregated packet, then it is split in to individual NALU data. Then the sequence of NALUs generated from this group of packets of same timestamp and nal-type are combined in to a single RTMP payload to be sent to the Flash Player.
As mentioned in my previous article, there are a few gotchas. You must use the new-style RTMP handshake, otherwise the Flash Player will not decode/display the received H.264 stream. You must use Flash Player 11.2 (beta) or later when using "live" mode, otherwise the Flash Player does not accept multiple slice NALUs of a single frame. If audio and video are enabled, then the timestamp of video must be synchronized with the timestamp of audio sent to RTMP. Note that RTP picks random initial timestamp for each media stream so the audio and video RTP timestamp values are not easily co-related unless using RTCP or external mechanism. You need to co-related the RTP timestamps of audio and video to a single timestamp clock of RTMP.
It is possible to do re-packetization of H.264 between Flash Player's RTMP and standard SIP/RTP without having to do actual video transcoding. This article explains the tricks and gotchas of doing so!
The implementation works between Flash Player 11.2 and a few SIP video phones such as Ekiga and Bria 3.
 ITU-T recommendation H.264, "advanced video coding for generic audiovisual services", March 2010.
 ISO/IEC International Standard 14496-10:2008.