what if we can compress video streams up to 300 times then transport it over the internet for genai models?
that’s exactly what i experimented this weekend with open-magvit-2 from tencent.
in the future, the main ingestor of video will be AI, and they don’t consume media like we do.
instead of consuming rgb values, they consume tokens.
in traditional video streaming pipeline for inference, you still need to encode video to h.264 on client then decode h.264 on server into a tensor.
so why not skip the whole pipeline up to the tokenization and eliminate h.264 to save bandwidth?
the simple math:
furthermore:

so i went with a simple test setup:
on m4 pro (for 0.5s chunk): encoding takes ~3s, decoding takes ~3s
on l40 (for 0.5s chunk): encoding takes ~0.1s, decoding takes ~0.04s

1. bandwidth reduction is real, but buffering is the enemy
to process video, the model takes 17 frames per inference step. this gets the temporal attention and improves temporal consistency. however, it means that the decoder has a 17 frame delay to the real world since it has to buffer this data.
2. it's not worth it atm
since the codec is not optimized for no gpu, it's going to be slow on non-compatible devices. it's promising if there is multiple video feed since in this case the bandwidth constraint is more pressing.
you can find the repo here.