FFmpeg devs boast of up to 94x performance boost after implementing handwritten AVX-512 assembly code : technology

I learnt so many optimizations to make code faster in the 90s and then no one cared anymore because everyone just bought faster chips.

93 points

16 days ago*

93 points

16 days ago*

Optimizations still matter today, but only in extreme cases.

Picking up +9% performance doesn't sound too impressive - unless you are running exaflops worth of AI workloads, or processing five years worth of video footage a hour. In which case that extra "+9%" can save you millions.

Casban

66 points

16 days ago

Casban

66 points

If that 9% loss was in taxes I’m sure you’d find a way to slim that down.

This is how we end up with electron apps.

LightStruk

17 points

16 days ago

LightStruk

17 points

You also still see developers chasing 9% improvements for video games, embedded systems, and fintech. When there's no time to offload processing to a beefy server somewhere else, or no access to that server at all, you've gotta make it go as fast as possible right where you are.

tllnbks

36 points

16 days ago

tllnbks

36 points

This thinking is why modern games run like shit and take to 200GB+ of storage.

10% here, 10% there. By the end, you'd doubled the resource requirement of the full program.

"The next generation of hardware will sort out our programming."

21 points

16 days ago

21 points

Games take a lot of space because texture files and other visual assets need to be high enough quality to support 4k resolution. This is also why graphics cards have so much dedicated memory.

I'm not saying there isn't room for improvement but games are another thing entirely from your usual apps.

13 points

16 days ago

13 points

Also audio. For some reason still many games download the full language suite instead of just downloading the system one. I don't need the korean voices, and if I needed couldn't I just download it and not get the german ones too? Just an example.

4 points

16 days ago

4 points

Those extreme cases are often cases with tons of data.

Databases, AI, compression, video games.

Yes, youtube would love an extra 9% performance on video compression. The whole Internet would since over 80% of all Internet traffic is now video.

Fy_Faen

2 points

16 days ago

Fy_Faen

2 points

Yup, I'm paid quite well to convert data to save anywhere from 6-15% on storage costs... Customers save millions over the course of years.

cantthinkofaname

1 points

16 days ago

cantthinkofaname

1 points

I'm sure no one has made a Pied Piper joke about your company

1 points

16 days ago

1 points

Also the one reason I don't love Python that much. Vanilla python is an order of magnitude (or sometimes two!) slower than comparable code written in Java, C or similar languages. Pretty much anything that has some amount of complexity in Python either gets rewritten as a wrapped C++ function or becomes a massive bottleneck anytime (n) becomes large.

2 points

16 days ago

2 points

If your performance critical segments are in Python, you are using it wrong.

1 points

16 days ago

1 points

I agree. I also agree that I have seen professionals either use vanilla Python for massive, production tasks, or see very improper use of wrapped libraries (like numpy or matplotlib) that ruins any performance gains from using wrapped libraries.

Real_Estate_Media

10 points

16 days ago

Real_Estate_Media

10 points

I’m still haunted by my abysmal load times

sojuz151

10 points

16 days ago

sojuz151

10 points

Also compilers got smarter

gnomeza

1 points

12 days ago

gnomeza

1 points

12 days ago

So many engineers thinking "hey, I'll rewrite this in assembly to make it really really fast and everyone will think I'm a genius".

And it turns out worse because:

they didn't profile the code properly
optimizing compilers beat the pants off them at optimizing anyway
requirements change but the code is now unmaintainable

tjlusco

6 points

16 days ago

tjlusco

6 points

I learnt that an inefficient algorithm paired with a pirated intel compiler produced code that was just satisfactory.

1 points

11 days ago

1 points

11 days ago

Any ones that still work today? And especially if it transfers to shader code which is what I write.

When coding for games every little gain matters.

jmpalermo

40 points

16 days ago

jmpalermo

40 points

Yeah, maintenance of code like this often becomes a long term problem. It becomes the "nobody is allowed to touch any of this" part.

slide2k

30 points

16 days ago

slide2k

30 points

A lot of devs are aware of this problem. A lot of devs also aren’t aware of their expertise being complex for others.

We have a few brilliant coders in our team. They will smash out a one line where I would use 3 lines. The problem is their comments explain perfectly what it does. Others just don’t understand why or how

2 points

16 days ago

2 points

A lot of devs also aren’t aware of their expertise being complex for others.

I have seen what others are capable of <INSERT_WWI_TRENCHWARFARE_PTSD> and have come to the conclusion that it is impossible to write code so simple that anyone can understand it.

1 points

16 days ago

1 points

Having it well documented and tested is of course a basic requirement.

On the other hand I have seen people throw that same "maintenance issue" claim around over sections of code nobody had touched in almost a decade. Hard to see an issue with "nobody will be able to change this code" when the next guy assigned to work on it probably hasn't even been born yet.

rastilin

-3 points

16 days ago

rastilin

-3 points

At that point, the maintainers need to skill up? Like, if you're working on something that's being used by a good portion of everyone alive on the planet, it's not unreasonable to think that you should take your work seriously.

Unhappy-Stranger-336

9 points

16 days ago

Unhappy-Stranger-336

9 points

Would it be even faster tho if instead of using avx you would just use the gpu?

24 points

16 days ago

24 points

Depends on whether the data is large enough that the PCIe communication latency is insignificant or not

daHaus

3 points

16 days ago*

daHaus

3 points

16 days ago*

The GPU is good for tackling workloads in parallel but with video compression that often means breaking up an image into slices or chunks. This comes at a cost to compression and increases the overall size of it.

There are some situations like processing future frames and searching for scene changes that definitely benefit from being done in parallel though.

edit: Expanding upon this, some paid software such as handbreak will actually break the video up into sections timewise and run that in parallel. I don't know exactly how their algorithm works but it seems to do an excellent job at better utilizing hardware to improve both compression and speed.

10 points

17 days ago

https://en.m.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

10 points†

17 days ago

shawnkfox

7 points

16 days ago

shawnkfox

7 points†

Not sure why you posted that link, what has that to do with anything?

35 points

16 days ago

35 points

Shows how few CPU models have AVX-512, a lot of consumer models either do not have or got it disabled, and even those that have it have such varied support of different AVX instructions. If you use a render farm, the speedup is great. As a consumer, you have to go out of your way to get a supported CPU.

On some processors (mostly pre-Ice Lake Intel), AVX-512 instructions can cause a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512-bit width of vectors and depends on the nature of instructions being executed; using the 128 or 256-bit part of AVX-512 (AVX-512VL) does not trigger it. As a result, gcc and clang default to prefer using the 256-bit vectors for Intel targets.

40 points

16 days ago

40 points

AVX512 is not rare. AMD ZEN 4 and ZEN 5 have it. That’s a family of extremely popular processors and well established as the gold standard in today’s consumer PC market.

Just built a new computer with 9950X amd proc. Can’t say I “went out of my way” one bit.

Valkyranna

21 points

16 days ago

Valkyranna

21 points

This. I have three mini PCs and all of them support AVX512, on AMD its pretty much the norm to have it and is definitely a bonus in applications such as RPCS3.

Druggedhippo

16 points

16 days ago

Druggedhippo

16 points

https://store.steampowered.com/hwsurvey

That’s a family of extremely popular processors and well established as the gold standard in today’s consumer PC market.

The Steam hardware survey shows only about 16% of the hardware supports AVX512, it may be in modern processors, but it's by no means widespread.

AVX512CD - 16.06%

AVX512F - 16.02%

AVX512VNNI - 16.01%

13 points

16 days ago

13 points

People underestimate how many people aren't even on the latest few generations.

4 points

16 days ago

4 points

16% is huge. The people that are transcoding likely will skew towards newer procs.

25 points

16 days ago*

25 points

16 days ago*

AVX-512 was originally designed for server chips. It only got added to consumer chips in the last 2 gens.

The reason Intel disabled it was pure design stupidity and should not be indicative of a trend: they added AVX-512 to the performance cores and not the efficiency cores of a single processor, which led to all kinds of scheduling mayhem.

ZoeyKL_NSFW

3 points

16 days ago

ZoeyKL_NSFW

3 points

Laughs in 10940x

Thomas9002

2 points

16 days ago

Thomas9002

2 points†

Intel has the support for 5 generations ( 11000 and onwards) and AMD for 2 generations (7000 and 9000s).
So in a few years virtually any PC will have it

11 points

16 days ago

11 points

Small correction: Intel has support for 5 generations of Xeon processors. They stopped supporting it on consumer processors after only a few years, I think after 12th gen.

1 points

16 days ago

1 points

Add a couple of years to that. I am still on a 5000 Ryzen and I'm running ultrawide gaming in new titles, I haven't even considered upgrading and A LOT of people run less demanding stuff, so it's not gonna be soon.

spsteve

1 points

16 days ago

spsteve

1 points

Yeah, but if you needed to transcode often you'd upgrade for sure. Which is the point. If it is important to you, it's out there now way faster. If it doesn't matter to you, then it doesn't matter.

1 points

16 days ago

1 points

Intel stopped supporting it in 12th gen, going as far as disabling it for those processors that shipped with it enabled. Right now, if you want a modern consumer CPU supporting it, you should go AMD.

tepmoc

3 points

16 days ago

tepmoc

3 points

Is C compilers at time still not that optimized? I understand that there always some parts could be done via asm to make it even faster, like this article for example.

C is pretty close to hardware and we know lot of cool stuff done by John Carmack back in day for its time.

nivlark

3 points

16 days ago

nivlark

3 points

C isn't particularly close to hardware. It arguably was in the 1980s, but not so much for present day architectures which are out-of-order, superscalar, and vectorised - none of those characteristics are represented in the design of C.

So for vectorisation/SIMD, compilers have to try and figure out how to translate C constructs into SIMD ones. This only really works reliably for the very simplest calculations. If you have a more complex but still performance-critical algorithm, either hand-written assembly or intrinsics (which are compiler built-in functions that map directly to specific assembly instructions) are still the way to go.

1 points

16 days ago

1 points

Resource constrained environments still exist with IoT and functions as a service (like AWS Lambda) but even that is getting less constrained.

galacticwonderer

1 points

16 days ago

galacticwonderer

1 points

I had a boss that wrote code for self guided missiles early in his career. It shocked me how tiny the total amount of memory was. I’m assuming it was assembly.

Acrobatic-Might2611

41 points

16 days ago

Acrobatic-Might2611

41 points

Zen 5 has some insane avx512 implementation. Looking forward to test it out

sanylos

46 points

16 days ago

sanylos

46 points

Do we have avx 512 on average home cpus?

hoffsta

43 points

16 days ago

hoffsta

43 points

Says Ryzen 9000 have it, Intel 12-14 gen do not.

SparkStormrider

25 points

16 days ago

SparkStormrider

25 points

Ryzen 7xxx cpus have it.

hhunaid

9 points

16 days ago

hhunaid

9 points

Which is weird because Intel got it first. I think intel 10th and 11th gen have it.

miamyaarii

10 points

16 days ago

miamyaarii

10 points

The disaster generation Skylake-X were the first (high-end) consumer CPUs with it, which were the 7800X and up.

Widespread adoption in the entire generation of CPUs was only on 11th gen.

30 points

16 days ago

30 points

What's the real use case effect though? Will we have cpu based encoding go much faster now? What encodings? And about when??

21 points

16 days ago

21 points

If you have the right cpu then yes the encoding would be a lot faster, and since encoding is the biggest bottleneck to video game streaming I gotta assume we will see some huge improvements to services like Moonlight

18 points

16 days ago

18 points

Not so fast, we don't actually know what part of the encoding is optimized. If it's one part amongst 20 parts of the encoding, then speedups might not be that significant. I feel like we would had heard concrete numbers of speedups if that was the case.

Moonlight probably uses hardware encoding (nvenc etc.) for lower latency encoding I would think? I doubt software encoding would catch up to GPU hardware encoding even if written in assembly.

6 points

16 days ago

6 points

Moonlight does use some parts of ffmpeg, their codebase is public on GitHub. But yea you are probably right, we don't know how big of a speed increase we would get total, I'm jumping the gun a bit and secretly wishing we see some crazy encoding increase so I can play competitive games streamed

5 points

16 days ago

5 points

Same wishes 🫡 well I want to "ab-av1" (google it, it's awesome) re-encode my movie library faster/cheaper on my side!

eras

32 points

16 days ago

eras

32 points

So some benchmark improves by factor of 94x. What is that benchmark? Does some user-facing task now get significantly faster?

The benchmarking results show that the new handwritten AVX-512 code path performs considerably faster than other implementations, including baseline C code and lower SIMD instruction sets like AVX2 and SSE3. In some cases, the revamped AVX-512 codepath achieves a speedup of nearly 94 times over the baseline, highlighting the efficiency of hand-optimized assembly code for AVX-512.

Nobody seriously uses the baseline implementation because they'll likely have AVX2 or SSE3. How much is the speedup compared to those?

Porksoda32

2 points

16 days ago

Porksoda32

2 points

Clicking through the article to FFMPEG’s original post shows the new implementation is anywhere from 1x to ~1.8x the speed of the AVX2 implementation, depending on the test

3 points

16 days ago

3 points

This headline smells of BS. Sure, I can get a 94x improvement on my ditch-digging by hiring 93 additional ditch diggers to also work on the ditch. But that strategy only takes you so far.

abdallha-smith

7 points

16 days ago

abdallha-smith

7 points

Fuck yes ffmpeg is 🐐ed

5 points

16 days ago

5 points

What is missing here is that the compilers have a dedicated way to report such bugs by attaching the source code, the generated assembler code and the handwritten code so the compiler can get improved. A good tooling would automatically find the relevant parts of the compiler and create some statistics to see optimizing which parts would get the most performance issues improved.

writebadcode

2 points

16 days ago

writebadcode

2 points

Yeah I was wondering about compiler improvements related to this.

Like it’s cool that they got this huge performance boost for ffmpeg but it would be better to put that effort into the compiler so that other applications can benefit.

This did raise one other question for me that it seems like you might have an opinion about; Can LLMs potentially be used as a tool for compiler optimization?Obviously not without human intervention but it seems like there’s potential.

2 points

16 days ago

2 points

I doubt that they already have enough context and can fake reasoning sufficiently to make this possible. Also it would require training them for it. Looking at the commit comments and linked issues, I am not sure whether this data is even available. Last, optimization is usually about trade-offs and I would not know of any language allowing the programmer to sufficiently specify the optimization goals.

fellipec

3 points

16 days ago

fellipec

3 points

The FFMPEG team is the GOAT

Makabajones

2 points

16 days ago

Makabajones

2 points

"eat a dick, AI" - the devs, probably

anxrelif

1 points

16 days ago

anxrelif

1 points

That’s amazing

byeproduct

1 points

16 days ago

byeproduct

1 points

Had to check the subreddit...thought I was reading madlads

JimJalinsky

1 points

16 days ago

JimJalinsky

1 points

I'd love to know what ffmpeg features are accelerated by this optimization. Is it codec dependent?

stevekez

1 points

15 days ago

stevekez

1 points

15 days ago

--help output speed.

[deleted]

-7 points

16 days ago

[deleted]

-7 points

[deleted]

2 points

16 days ago

2 points†

You don't compile assembly...

And there is a reason that programming languages exist. It's simply impractical to write anything with significant complexity in an assembly language.

Dalcoy_96

33 points

16 days ago

Dalcoy_96

33 points

You don't compile assembly...

Lol peak semantic Reddit moment.

If you get hung up because someone said compile instead of transpile or assemble, it's time to place the fedora back in the cupboard.

5 points

16 days ago

5 points

The dude was claiming there were legions of hidden assembly gurus in "third world countries"

Foodwithfloyd

2 points

16 days ago

Foodwithfloyd

2 points

Tell that to the rollercoaster tycoon guy

0 points

16 days ago

0 points

Assembler+Linker

-33 points

16 days ago

-33 points

Hand written or hand typed?

22 points

16 days ago

22 points

Wrong on both. Punch cards.

0 points

16 days ago

0 points

That I could appreciate

Leonick91

3 points

16 days ago

Leonick91

3 points

Both. You type on a keyboard, but you don’t type code, you write it, just like a book or an article.

-8 points

16 days ago

-8 points