> The other drawback of this method is that the optimizer won’t even touch anything involving floats (f32 and f64 types). It’s not permitted to change any observable outputs of the program, and reordering float operations may alter the result due to precision loss. (There is a way to tell the compiler not to worry about precision loss, but it’s currently nightly-only).
Ah - this makes a lot of sense. I've had zero trouble getting excellent performance out of Julia using autovectorization (from LLVM) so I was wondering why this was such a "thing" in Rust. I wonder if that nightly feature is a per-crate setting or what?
No it only uses the same LLVM compiler passes and you enable certain optimizations locally via macros if you want to allow reordering in a given expression.
We’re in very nitpicky terminology weeds here (and I’m not the person you’re replying to), but my understanding is “commutative” is specifically about reordering operands of one binary op (4+3 == 3+4), while “associative” is about reordering a longer chain of the same operation (1+2+3 == 1+3+2).
Edit: Wikipedia actually says associativity is definitionally about changing parens[0]. Mostly amounts to the same thing for standard arithmetic operators, but it’s an interesting distinction.
It is not a nit it is fundamental, a•b•c is associativity, specifically operator associativity.
Rounding and eventual underflow in IEEE means an expression X•Y for any algebraic operation • produces, if finite, a result (X•Y)·( 1 + ß ) + µ where |µ| cannot exceed half the smallest gap between numbers in the destination’s format, and |ß| < 2^-N , and ß·µ = 0 . ( µ ≠ 0 only when Underflow occurs.)
And yes that is a binary relation only
a•b•c is really (a•b)•c assuming left operator associativity, one of the properties that IEEE doesn't have.
"a+b+c" doesn't describe a unique evaluation order. You need some parentheses to disambiguate which changes are due to associativity vs commutativity. a+(b+c)=(c+b)+a should be true of floating point numbers, due to commutativity. a+(b+c)=(a+b)+c may fail due to the lack of associativity.
IEEE 754 floating-point addition and multiplication are commutative in practice, even if there are exceptions with NaNs etc..
But remember that commutative is on the operations (+,x) which are binary operations, a+b=b+a and ab=ba, you can get accumulated rounding errors on iterated forms of those binary operations.
For vectorizing, that quote is only true for loops with dependencies between iterations, e.g. summing a list of numbers (..that's basically the only case where this really matters).
For loops without such dependencies Rust should autovectorize just fine as with any other element type.
Odd that c# has a better stable SIMD story than Rust! It has both generic vector types across a range of sizes and a good set of intrinsics across most of the common instruction sets
Why would that be odd? C# is an older and mature language backed by a corporation, while Rust is younger and has been run by a small group of volunteers for years now.
C# portable SIMD is very nice indeed, but it's also not usable without unsafety. On the other hand, Rust compiler (LLVM) has a fairly competent autovectorizer, so you may be able to simply write loops the right way instead of the fancy API.
Unsafety means different things. In C#, SIMD is possible via `ref`s, which maintains GC safety (no GC holes), but removes bounds safety (array length check). The API is called appropriately Vector.LoadUnsafe
You are not "forced" into unsafe APIs with Vector<T>/Vector128/256/512<T>. While it is a nice improvement and helps with achieving completely optimal compiler output, you can use it without unsafe. For example, ZLinq even offers .AsVectorizable LINQ-style API, where you pass lambdas which handle vectors and scalars separately. It the user code cannot go out of bounds and the resulting logic even goes through (inlined later by JIT) delegates, yet still offers a massive speed-up (https://github.com/Cysharp/ZLinq?tab=readme-ov-file#vectoriz...).
Yeah, golang is a particular nightmare for SIMD. You have to write plan 9 assembly, look up what they renamed every instruction to, and then sometimes find that the compiler doesn't actually support that instruction, even though it's part of an ISA they broadly support. Go assembly functions are also not allowed to use the register-based calling convention, so all arguments are passed on the stack, and the compiler will never inline it. So without compiler support I don't believe there's any way to do something like intrinsics even. Fortunately compiler support for intrinsics seems to be on its way! https://github.com/golang/go/issues/73787
There really aren't that many people working on the compiler. It's mostly volunteers.
The structure is unlike a traditional company. In a traditional company, the managers decide the priorities and direct the employees what to work on while facilitating that work. While there are people with a more managerial type position working on rust compiler, their job is not to tell the volunteers what to work on (they cannot), but instead to help the volunteers accomplish whatever it is they want to do.
I don't know about std::simd specifically, but for many features, it's simply a case of "none of the very small number of people working on the rust compiler have prioritized it".
I do wish there was a bounty system, where people could say "I really want std::simd so I'll pay $5,000 to the rust foundation if it gets stabilized". If enough people did that I'm sure they could find a way to make it happen. But I think realistically, very few people would be willing to put up even a cent for the features they want. I hear a lot of people wishing for better const generics, but only 27 people have set up a donation to boxy (lead of the const generics group https://github.com/sponsors/BoxyUwU ).
I think it seems just right. Languages these days are either controlled by volunteers or megacorps. Because linux is about freedom and is not aligned with megacorps, I think they'd prefer a volunteer-driven language like Rust or C++ rather than the corporate ones.
I’m not sure you can argue that Rust and C++ have anything like a similar story around being volunteer oriented, given the number of places that have C++ compiler groups that contribute papers / implementations.
- It's a massive hard problem, to build a portable abstraction layer over the SIMD capabilities of various CPUs.
- It's a massive balance between performance and usability, and people care deeply about both.
- It's subject to Rust's stability guarantee for the standard library: once we ship it, we can't fix any API issues.
- There are already portable SIMD libraries in the ecosystem, which aren't subject to that stability guarantee as they can ship new semver-major versions. (One of these days, I hope we have ways to do that for the standard library.)
- Many people already use non-portable SIMD for the 1-3 targets they care about, instead.
> Many people already use non-portable SIMD for the 1-3 targets they care about, instead.
This is something a lot of people (myself included) have gotten tripped up by. Non-portable SIMD intrinsics have been stable under std::arch for a long time. Obviously they aren't nearly as nice to hold, but if you're in a place where you need explicit SIMD speed-ups, that probably isn't a killer.
Partially (with upcoming support for renaming things across editions), but it's a pain if the types change (because then they're no longer common vocabulary), and all the old APIs still have to exist.
There is a GitHub issue that details what's blocking stabilization for a each feature. I've read a few recently and noticed some patterns:
1. A high bar for quality in std
2. Dependencies on other unstable features
3. Known bugs
4. Conflicts with other unstable features
It seems anything that affects trait solving is very complicated and is more likely to have bugs or combine non-trivially with other trait-solving features.
I think there is also some sampling bias. Tons of features get stabilized, but you are much more likely to notice a nightly feature that is unstable for a long time and complex enough to be excited about.
> It seems anything that affects trait solving is very complicated and is more likely to have bugs or combine non-trivially with other trait-solving features.
Yep and this is why many features die or linger on forever. Getting the trait solving working correctly across types and soundly across lifetimes is complicated enough to have killed several features previously (like specialization/min_specialization). It was the reason async trait took so long and why GAT were so important.
I think they meant on unstable features which might yet change their semantics. A stable API relying on unstable implementation is common in Rust (? operator, for example), but that is entirely dependent on having a good idea of what the eventual stable version is going to look like, in such a way that the already stable feature won't break in any way.
Usually when I go and read the github and zulip threads the reason for paused work comes down to the fact that no one has come up with a design that maintains every existing promise the compiler has made. The most common ones I see are the feature conflicts with safety, semver/encapsulation, interacts weirdly with object safety, causes post post-monomorphization errors, breaks perfect type class coherence (see haskells unsound specialization).
Too many promises have been made.
Rust needs more unsafe opt outs. Ironically simd has this so it does not bother me.
Given the “blazingly fast” branding, I too would have thought this would be in stable Rust by now.
However, like other commenters I assume it’s because it’s hard, not all that many users of Rust really need it, and the compiler team is small and only consists of volunteers.
Don’t forget that autovectorization does a lot too. This is only for when you want to ensure you get exactly what you want, for many applications, they just kinda get it for free sometimes.
Rust doesn’t have a BDFL so there’s nobody with the power to push things through when they’re good enough.
And since Rust basically sells itself on high standards (zero-cost abstractions, etc.) the devs go back and forth until it feels like the solution is handed down from the heavens.
And somehow it has ended up feeling more pleasant and consistent than most languages with a BDFL, even though it was designed by committee. I don't really understand how that happened, but I appreciate the cautious and conservative approach they've taken
std::arch::* intrinsics for SIMD are stable and you can use them today. The situation is only slightly worse than C/C++ because the rust compilers cares a lot about undefined behavior, so there's some safe-but-technically-unsafe/annoying cfg stuff to make sure the intrinsics are actually emitted as you intend.
There is nothing blocking high quality SIMD libraries on stable in Rust today. The bar for inclusion in std is just much higher than the rest of the ecosystem.
I would love generators too but I think the more features they add the more interactions with existing features they have to deal with, so it's not surprising that its slowing down.
Generators in particular has been blocked on the AsyncIterator trait. There are also open questions around consuming those (`for await i in stream`, or just keep to `while let Some(i) in stream.next().await`? What about parallel iteration? What about pinning obligations? Do that as part of desugaring or making it explicit?). It is a shame because it is almost orthogonal, but any given decision might not be compatible with different approaches for generators. The good news is that some people are working on it again.
simd was one I thought we needed. Then, i started benchmarking using iter with chunks and a nested if statement to check the chunk size. If it was necessary to do more, it was typically time to drop down to asm rather than worry about another layer in between the code and the machine.
I think you misinterpreted GP; he's saying that with some hints (explicit chunking with a branch on the chunk size), the compiler's auto-vectorization can handle the rest, inferring SIMD instructions in a manner that's 'good enough'.
Of interest, I've written my own core::simd mimic so I don't have to make all my libs and programs use nightly. It started as me just making my Quaternion and Vec lib (lin-alg) have their own SoA SIMD variants (Vec3x16 etc), but I ended up implementing and publicly exposing f32x16 etc. Will remove those once core::simd is stable. Downside: These are x86 only; no ARM support.
I also added packing and unpacking helpers that assist with handling final lane 0 values etc. But there is still some subtly, as the article pointed out, compared to using Rayon or non-SIMD CPU code related to packing and unpacking. E.g. you should try to keep things in their SIMD form throughout the whole pipeline, how you pair them with non-SIMD values (Like you might pair [T; 8] with f32x8 etc) etc.
The nightly built-in core::simd makes use of a bunch of intrinsics to "implement" the SIMD ops (or, rather, directly delegate the implementation to LLVM which you otherwise cannot do from plain Rust), which are as much if not more volatile than core::simd itself (and also nightly-only).
> or, rather, directly delegate the implementation to LLVM which you otherwise cannot do from plain Rust
I thought the intrinsic specifically were available in plain safe rust and the alignment required intrinsics were allowed in unsafe rust. I’m not sure I understand this “direct to llvm dispatch” argument or how that isn’t accessible to stable Rust today.
You can indeed use intrinsics to make a SIMD library in plain safe stable rust today to some extent; that just isn't what core::simd does; rather, on the Rust-side it's all target-agnostic and LLVM (or whatever other backend) handles deciding how to lower any given op to the target architecture.
e.g. all core::simd addition ends up invoking the single function [1] which is then directly handled by rustc. But these architecture-agnostic intrinsics are unstable[2] (as they're only there as a building block for core::simd), and you can't manually use "#[rustc_intrinsic]" & co in stable rust either.
I'm curious on the uptake of SIMD and other assembly level usage through high level code? I'd assume most is done either by people writing very low level code that directly manages the data, or by using very high level libraries that are prescriptive on what data they work with?
How many people are writing somewhat bog standard RUST/C and expect optimal assembly to be created?
It's really only comparable to assembly level usage in the SIMD intrinsics style cases. Portable SIMD, like std::simd, is no more assembly level usage than calling math functions from the standard library.
Usually one only bothers with the intrinsic level stuff for the use cases you're saying. E.g. video encoders/decoders needing hyper-optimized, per architecture loops for the heavy lifting where relying on the high level SIMD abstractions can leave cycles on the table over directly targeting specific architectures. If you're just processing a lot of data in bulk with no real time requirements, high level portable SIMD is usually more than good enough.
My understanding was that the difficulty with the intrinsics was more in how restrictive they are in what data they take in. That is, if you are trying to be very controlling of the SIMD instructions getting used, you have backed yourself into caring about the data that the CPU directly understands.
To that end, even "calling math functions" is something that a surprising number of developers don't do. Certainly not with the standard high level data types that people often try to write their software into. No?
More than that: many of the intrinsics can be unsafe in standard Rust. This situation got much better this year but it's still not perfect. Portable SIMD has always been safe, because they are just normal high level interfaces. The other half is intrinsics are specific to the arch. Not only do you need to make sure the CPUs support the type of operation you want to do, but you need to redo all of the work to e.g. compile to ARM for newer MacBooks (even if they support similar operations). This is also not a problem using portable SIMD, the compiler will figure out how to map the lanes to each target architecture. The compiler will even take portable SIMD and compile it for a scalar target for you, so you don't have to maintain a SIMD vs non-SIMD path.
By "calling math functions" I mean things like:
let x = 5.0f64;
let result = x.sqrt()
Where most CPUs have a sqrt instruction but the program will automatically compile with a (good) software substitution for targets that don't. Neither SIMD nor these kind of functions work with high level data types, the only way to play is to write the object to understand how to break it down so the compiler knows what you want to do. With intrinsics you need to go a step further beyond that and tell the compiler what CPU instructions should be used for each step directly.
I am torn -- while I love the bitter critique of std::simd's nightly builds (why bother with any public release if it is never stable?), I cringed at the critique of "(c)urrently things are well fleshed out for i32, i64, f32, and f64 types". f64 and i64 go a long way for most numerical applications -- the OP seemed snowflaky to me with that entitled concern.
Not supporting other types (particularly smaller ones) can be quite limiting on portable SIMD, especially when it doesn't support AVX512 either, but those are certainly a good core group - just not the whole story. Regardless, I'm not sure how painting the OP as an entitled snowflake helps anything over just asking the question.
(that said autovectorization should work, and fixed-width SIMD should map to RVV as best as possible, though of course missing out on perf if ran on wider-than-minimum hardware not on a native build)
Linux of course does have an interface for RISC-V extension probing via hwprobe. And there's a C interface[1] for probing that's OS-agnostic (though it's rather new).
Regarding autovectorization:
> The other drawback of this method is that the optimizer won’t even touch anything involving floats (f32 and f64 types). It’s not permitted to change any observable outputs of the program, and reordering float operations may alter the result due to precision loss. (There is a way to tell the compiler not to worry about precision loss, but it’s currently nightly-only).
Ah - this makes a lot of sense. I've had zero trouble getting excellent performance out of Julia using autovectorization (from LLVM) so I was wondering why this was such a "thing" in Rust. I wonder if that nightly feature is a per-crate setting or what?
It's not something you seem to be able to just enable globally. From what I gather this is what is being referenced:
https://doc.rust-lang.org/std/intrinsics/index.html
Specifically the *_fast intrinsics.
Does Julia ignore the problem of floating point not being associative, commutative nor distributive?
The reason it’s a thing is from LLVM and I’m not sure you can “language design” your way out of this problem as it seems intrinsic to IEEE 754.
No it only uses the same LLVM compiler passes and you enable certain optimizations locally via macros if you want to allow reordering in a given expression.
Nitpick, but IEEE float operations are commutative (when relevant and appropriate). Associative and distributive they indeed are not.
Unless I’m having a brain fart it’s not commutative or you mean something by “relevant and appropriate” that I’m not understanding.
a+b+c != c+b+a
That’s why you need techniques like Kahan summation.
We’re in very nitpicky terminology weeds here (and I’m not the person you’re replying to), but my understanding is “commutative” is specifically about reordering operands of one binary op (4+3 == 3+4), while “associative” is about reordering a longer chain of the same operation (1+2+3 == 1+3+2).
Edit: Wikipedia actually says associativity is definitionally about changing parens[0]. Mostly amounts to the same thing for standard arithmetic operators, but it’s an interesting distinction.
[0]: https://en.wikipedia.org/wiki/Associative_property
It is not a nit it is fundamental, a•b•c is associativity, specifically operator associativity.
Rounding and eventual underflow in IEEE means an expression X•Y for any algebraic operation • produces, if finite, a result (X•Y)·( 1 + ß ) + µ where |µ| cannot exceed half the smallest gap between numbers in the destination’s format, and |ß| < 2^-N , and ß·µ = 0 . ( µ ≠ 0 only when Underflow occurs.)
And yes that is a binary relation only
a•b•c is really (a•b)•c assuming left operator associativity, one of the properties that IEEE doesn't have.
"a+b+c" doesn't describe a unique evaluation order. You need some parentheses to disambiguate which changes are due to associativity vs commutativity. a+(b+c)=(c+b)+a should be true of floating point numbers, due to commutativity. a+(b+c)=(a+b)+c may fail due to the lack of associativity.
It is not, due to precision. Consider a=1.00000, b=-0.99999, and c=0.00000582618.
No, the two evaluations will give you exactly the same result: https://play.rust-lang.org/?version=stable&mode=debug&editio...
IEEE 754 operations are nonassociative, but they are commutative (at least if you ignore the effect of NaN payloads).
You still need to specify an evaluation order …
Does (1.00000+-0.99999)+0.00000582618 != 0.00000582618+(-0.99999+1.00000) ? This would disprove commutativity. But I think they're equal.
IEEE 754 floating-point addition and multiplication are commutative in practice, even if there are exceptions with NaNs etc..
But remember that commutative is on the operations (+,x) which are binary operations, a+b=b+a and ab=ba, you can get accumulated rounding errors on iterated forms of those binary operations.
For vectorizing, that quote is only true for loops with dependencies between iterations, e.g. summing a list of numbers (..that's basically the only case where this really matters).
For loops without such dependencies Rust should autovectorize just fine as with any other element type.
We used to tweak our scalar product simulator code to match the SIMD arithmetic order so we could hash the outputs for tests.
I wonder if it could autovec the simd-ordered code.
Odd that c# has a better stable SIMD story than Rust! It has both generic vector types across a range of sizes and a good set of intrinsics across most of the common instruction sets
Why would that be odd? C# is an older and mature language backed by a corporation, while Rust is younger and has been run by a small group of volunteers for years now.
not just any corporation.. the largest software corporation on the planet
not just any largest software corporation, one of my two least favourite largest software corporations on the planet.
not just any least favourite largest software corporation of yours...
the one that most contributes to open source from the largest corporations. so one of my favourites because of that
they were also one of the first of the large corps to show interest in Rust
C# portable SIMD is very nice indeed, but it's also not usable without unsafety. On the other hand, Rust compiler (LLVM) has a fairly competent autovectorizer, so you may be able to simply write loops the right way instead of the fancy API.
Unsafety means different things. In C#, SIMD is possible via `ref`s, which maintains GC safety (no GC holes), but removes bounds safety (array length check). The API is called appropriately Vector.LoadUnsafe
You are not "forced" into unsafe APIs with Vector<T>/Vector128/256/512<T>. While it is a nice improvement and helps with achieving completely optimal compiler output, you can use it without unsafe. For example, ZLinq even offers .AsVectorizable LINQ-style API, where you pass lambdas which handle vectors and scalars separately. It the user code cannot go out of bounds and the resulting logic even goes through (inlined later by JIT) delegates, yet still offers a massive speed-up (https://github.com/Cysharp/ZLinq?tab=readme-ov-file#vectoriz...).
Another example, note how these implementations, one in unsafe C# and another in safe F# have almost identical performance: https://benchmarksgame-team.pages.debian.net/benchmarksgame/..., https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
C# is blessed on that front. Java’s SIMD state is still sad, and golang is not as great either.
Yeah, golang is a particular nightmare for SIMD. You have to write plan 9 assembly, look up what they renamed every instruction to, and then sometimes find that the compiler doesn't actually support that instruction, even though it's part of an ISA they broadly support. Go assembly functions are also not allowed to use the register-based calling convention, so all arguments are passed on the stack, and the compiler will never inline it. So without compiler support I don't believe there's any way to do something like intrinsics even. Fortunately compiler support for intrinsics seems to be on its way! https://github.com/golang/go/issues/73787
Go has been using register based calling for a while now?
GP comment said it's not used for FFI, not that it's not used.
Why isn’t std::simd in stabile yet? Why do so many great features seem stuck in the same nightly-forever limbo land - like generators?
I’m sure more people than ever are working on the compiler. What’s going on?
There really aren't that many people working on the compiler. It's mostly volunteers.
The structure is unlike a traditional company. In a traditional company, the managers decide the priorities and direct the employees what to work on while facilitating that work. While there are people with a more managerial type position working on rust compiler, their job is not to tell the volunteers what to work on (they cannot), but instead to help the volunteers accomplish whatever it is they want to do.
I don't know about std::simd specifically, but for many features, it's simply a case of "none of the very small number of people working on the rust compiler have prioritized it".
I do wish there was a bounty system, where people could say "I really want std::simd so I'll pay $5,000 to the rust foundation if it gets stabilized". If enough people did that I'm sure they could find a way to make it happen. But I think realistically, very few people would be willing to put up even a cent for the features they want. I hear a lot of people wishing for better const generics, but only 27 people have set up a donation to boxy (lead of the const generics group https://github.com/sponsors/BoxyUwU ).
> There really aren't that many people working on the compiler. It's mostly volunteers.
Seems smart to put the language as a requirement for compiling the linux kernel and a bunch of other core projects then!
I think it seems just right. Languages these days are either controlled by volunteers or megacorps. Because linux is about freedom and is not aligned with megacorps, I think they'd prefer a volunteer-driven language like Rust or C++ rather than the corporate ones.
I’m not sure you can argue that Rust and C++ have anything like a similar story around being volunteer oriented, given the number of places that have C++ compiler groups that contribute papers / implementations.
I'm not sure you can claim that Linux is about freedom. Linux is run by a bunch of corps and megacorps who are otherwise competing, not by volunteers.
> Why isn’t std::simd in stable yet?
Leaving aside any specific blockers:
- It's a massive hard problem, to build a portable abstraction layer over the SIMD capabilities of various CPUs.
- It's a massive balance between performance and usability, and people care deeply about both.
- It's subject to Rust's stability guarantee for the standard library: once we ship it, we can't fix any API issues.
- There are already portable SIMD libraries in the ecosystem, which aren't subject to that stability guarantee as they can ship new semver-major versions. (One of these days, I hope we have ways to do that for the standard library.)
- Many people already use non-portable SIMD for the 1-3 targets they care about, instead.
> Many people already use non-portable SIMD for the 1-3 targets they care about, instead.
This is something a lot of people (myself included) have gotten tripped up by. Non-portable SIMD intrinsics have been stable under std::arch for a long time. Obviously they aren't nearly as nice to hold, but if you're in a place where you need explicit SIMD speed-ups, that probably isn't a killer.
Exactly. Many parts of SIMD are entirely stable, for x86, ARM, WebAssembly...
The thing that isn't stable in the standard library is the portable abstraction layer atop those. But several of those exist in the community.
> we can't fix any API issues.
Can’t APIs be fixed between editions?
Partially (with upcoming support for renaming things across editions), but it's a pain if the types change (because then they're no longer common vocabulary), and all the old APIs still have to exist.
There is a GitHub issue that details what's blocking stabilization for a each feature. I've read a few recently and noticed some patterns:
1. A high bar for quality in std
2. Dependencies on other unstable features
3. Known bugs
4. Conflicts with other unstable features
It seems anything that affects trait solving is very complicated and is more likely to have bugs or combine non-trivially with other trait-solving features.
I think there is also some sampling bias. Tons of features get stabilized, but you are much more likely to notice a nightly feature that is unstable for a long time and complex enough to be excited about.
> It seems anything that affects trait solving is very complicated and is more likely to have bugs or combine non-trivially with other trait-solving features.
Yep and this is why many features die or linger on forever. Getting the trait solving working correctly across types and soundly across lifetimes is complicated enough to have killed several features previously (like specialization/min_specialization). It was the reason async trait took so long and why GAT were so important.
> Dependencies on other unstable features
AFAIK that’s not a blocker for Rust - the std library is allowed to use unstable at all times.
I think they meant on unstable features which might yet change their semantics. A stable API relying on unstable implementation is common in Rust (? operator, for example), but that is entirely dependent on having a good idea of what the eventual stable version is going to look like, in such a way that the already stable feature won't break in any way.
Usually when I go and read the github and zulip threads the reason for paused work comes down to the fact that no one has come up with a design that maintains every existing promise the compiler has made. The most common ones I see are the feature conflicts with safety, semver/encapsulation, interacts weirdly with object safety, causes post post-monomorphization errors, breaks perfect type class coherence (see haskells unsound specialization).
Too many promises have been made.
Rust needs more unsafe opt outs. Ironically simd has this so it does not bother me.
Given the “blazingly fast” branding, I too would have thought this would be in stable Rust by now.
However, like other commenters I assume it’s because it’s hard, not all that many users of Rust really need it, and the compiler team is small and only consists of volunteers.
Don’t forget that autovectorization does a lot too. This is only for when you want to ensure you get exactly what you want, for many applications, they just kinda get it for free sometimes.
Would love this. I've heard it's not planned to be in the near future. Maybe "perfect is the enemy of good enough"?
Rust doesn’t have a BDFL so there’s nobody with the power to push things through when they’re good enough.
And since Rust basically sells itself on high standards (zero-cost abstractions, etc.) the devs go back and forth until it feels like the solution is handed down from the heavens.
And somehow it has ended up feeling more pleasant and consistent than most languages with a BDFL, even though it was designed by committee. I don't really understand how that happened, but I appreciate the cautious and conservative approach they've taken
std::arch::* intrinsics for SIMD are stable and you can use them today. The situation is only slightly worse than C/C++ because the rust compilers cares a lot about undefined behavior, so there's some safe-but-technically-unsafe/annoying cfg stuff to make sure the intrinsics are actually emitted as you intend.
There is nothing blocking high quality SIMD libraries on stable in Rust today. The bar for inclusion in std is just much higher than the rest of the ecosystem.
I would love generators too but I think the more features they add the more interactions with existing features they have to deal with, so it's not surprising that its slowing down.
Generators in particular has been blocked on the AsyncIterator trait. There are also open questions around consuming those (`for await i in stream`, or just keep to `while let Some(i) in stream.next().await`? What about parallel iteration? What about pinning obligations? Do that as part of desugaring or making it explicit?). It is a shame because it is almost orthogonal, but any given decision might not be compatible with different approaches for generators. The good news is that some people are working on it again.
simd was one I thought we needed. Then, i started benchmarking using iter with chunks and a nested if statement to check the chunk size. If it was necessary to do more, it was typically time to drop down to asm rather than worry about another layer in between the code and the machine.
This is the most surprising comment to me. It’s that bad? I haven’t benchmarked it myself.
Zig has @Vector. This is a builtin, so it gets resolved at comptime. Is the problem with Rust here too much abstraction?
I think you misinterpreted GP; he's saying that with some hints (explicit chunking with a branch on the chunk size), the compiler's auto-vectorization can handle the rest, inferring SIMD instructions in a manner that's 'good enough'.
Of interest, I've written my own core::simd mimic so I don't have to make all my libs and programs use nightly. It started as me just making my Quaternion and Vec lib (lin-alg) have their own SoA SIMD variants (Vec3x16 etc), but I ended up implementing and publicly exposing f32x16 etc. Will remove those once core::simd is stable. Downside: These are x86 only; no ARM support.
I also added packing and unpacking helpers that assist with handling final lane 0 values etc. But there is still some subtly, as the article pointed out, compared to using Rayon or non-SIMD CPU code related to packing and unpacking. E.g. you should try to keep things in their SIMD form throughout the whole pipeline, how you pair them with non-SIMD values (Like you might pair [T; 8] with f32x8 etc) etc.
I'm not a rust programmer.
Can't you just make a local copy of the existing package and use that? Did you need to re-implement?
The nightly built-in core::simd makes use of a bunch of intrinsics to "implement" the SIMD ops (or, rather, directly delegate the implementation to LLVM which you otherwise cannot do from plain Rust), which are as much if not more volatile than core::simd itself (and also nightly-only).
> or, rather, directly delegate the implementation to LLVM which you otherwise cannot do from plain Rust
I thought the intrinsic specifically were available in plain safe rust and the alignment required intrinsics were allowed in unsafe rust. I’m not sure I understand this “direct to llvm dispatch” argument or how that isn’t accessible to stable Rust today.
You can indeed use intrinsics to make a SIMD library in plain safe stable rust today to some extent; that just isn't what core::simd does; rather, on the Rust-side it's all target-agnostic and LLVM (or whatever other backend) handles deciding how to lower any given op to the target architecture.
e.g. all core::simd addition ends up invoking the single function [1] which is then directly handled by rustc. But these architecture-agnostic intrinsics are unstable[2] (as they're only there as a building block for core::simd), and you can't manually use "#[rustc_intrinsic]" & co in stable rust either.
[1]: https://github.com/rust-lang/rust/blob/b01cc1cf01ed12adb2595...
[2]: https://github.com/rust-lang/rust/blob/b01cc1cf01ed12adb2595...
This is what I ended up doing as a stopgap.
Good question. Probably, but I don't know how and haven't tried.
> TL;DR: use std::simd if you don’t mind nightly, wide if you don’t need multiversioning, and otherwise pulp or macerator.
This matches the conclusion we reached for Chromium. We were okay with nightly, so we're using `std::simd` but trying to avoid the least stable APIs. More details: https://docs.google.com/document/d/1lh9x43gtqXFh5bP1LeYevWj0...
Do you compile the whole project with nightly or just specific components?
I'm curious on the uptake of SIMD and other assembly level usage through high level code? I'd assume most is done either by people writing very low level code that directly manages the data, or by using very high level libraries that are prescriptive on what data they work with?
How many people are writing somewhat bog standard RUST/C and expect optimal assembly to be created?
It's really only comparable to assembly level usage in the SIMD intrinsics style cases. Portable SIMD, like std::simd, is no more assembly level usage than calling math functions from the standard library.
Usually one only bothers with the intrinsic level stuff for the use cases you're saying. E.g. video encoders/decoders needing hyper-optimized, per architecture loops for the heavy lifting where relying on the high level SIMD abstractions can leave cycles on the table over directly targeting specific architectures. If you're just processing a lot of data in bulk with no real time requirements, high level portable SIMD is usually more than good enough.
My understanding was that the difficulty with the intrinsics was more in how restrictive they are in what data they take in. That is, if you are trying to be very controlling of the SIMD instructions getting used, you have backed yourself into caring about the data that the CPU directly understands.
To that end, even "calling math functions" is something that a surprising number of developers don't do. Certainly not with the standard high level data types that people often try to write their software into. No?
More than that: many of the intrinsics can be unsafe in standard Rust. This situation got much better this year but it's still not perfect. Portable SIMD has always been safe, because they are just normal high level interfaces. The other half is intrinsics are specific to the arch. Not only do you need to make sure the CPUs support the type of operation you want to do, but you need to redo all of the work to e.g. compile to ARM for newer MacBooks (even if they support similar operations). This is also not a problem using portable SIMD, the compiler will figure out how to map the lanes to each target architecture. The compiler will even take portable SIMD and compile it for a scalar target for you, so you don't have to maintain a SIMD vs non-SIMD path.
By "calling math functions" I mean things like:
Where most CPUs have a sqrt instruction but the program will automatically compile with a (good) software substitution for targets that don't. Neither SIMD nor these kind of functions work with high level data types, the only way to play is to write the object to understand how to break it down so the compiler knows what you want to do. With intrinsics you need to go a step further beyond that and tell the compiler what CPU instructions should be used for each step directly.I am torn -- while I love the bitter critique of std::simd's nightly builds (why bother with any public release if it is never stable?), I cringed at the critique of "(c)urrently things are well fleshed out for i32, i64, f32, and f64 types". f64 and i64 go a long way for most numerical applications -- the OP seemed snowflaky to me with that entitled concern.
Not supporting other types (particularly smaller ones) can be quite limiting on portable SIMD, especially when it doesn't support AVX512 either, but those are certainly a good core group - just not the whole story. Regardless, I'm not sure how painting the OP as an entitled snowflake helps anything over just asking the question.
Somewhat related, does rust handle the riscv vector extension in a similar way to simd?
Scalable vectors as in RVV & SVE aren't available in rust corrently; see https://github.com/rust-lang/rust/issues/145052
(that said autovectorization should work, and fixed-width SIMD should map to RVV as best as possible, though of course missing out on perf if ran on wider-than-minimum hardware not on a native build)
> Fortunately, this problem only exists on x86.
Also RISC-V, where you can't even probe for extension support in user space unfortunately.
Linux of course does have an interface for RISC-V extension probing via hwprobe. And there's a C interface[1] for probing that's OS-agnostic (though it's rather new).
[1]: https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/s...
It's not strictly x86 either, the other case you care about is fp16 support on ARM. But it is included in the M1 target, so really only on other ARM.
I really dislike those articles that are language focused. Why not try to share them in a way that is language agnostic?
This article is specifically about the implementation of SIMD in Rust, not other languages.