While what the article describes is clever, it is needlessly complex, and filled with various compiler switches and extensions.
In contrast, here's a stupid simple approach:
https://www.digitalmars.com/articles/C-biggest-mistake.html
where bounds-checkable arrays are declared as:
int a[..];
`a` consists of two fields, a `length` and a `pointer`. Indexing it means the compiler can (optionally) insert a bounds check it. int s[..] = "string";
s[10] = 'x'; // fatal runtime error
We can turn a pointer into a bounds checked array by "slicing" it: int *p = (int*) malloc(10);
int a[..] = p[0 .. 10];
A bounds checked array can be turned into a pointer: int *p = &a[3]; // point to 3rd element of a[..]
That's all there is to it. No pages and pages of compiler switches and extensions.Does it work? We've been doing that with D for over 20 years. Hell yeah, it works. It works fantastically well. It does not disturb any existing C code.
Also what you're proposing...would be an extension!
With the [..] proposal, it is easy enough to convert it back and forth between pointers and [..] to conform to required interfaces. One could even make the [..] implicitly convertible to a pointer.
> We can turn a pointer into a bounds checked array by "slicing" it:
> int *p = (int*) malloc(10); > int a[..] = p[0 .. 10];
When `p` is a parameter in a function, the function cannot know that it can create a slice of up to 10 elements (I assume that the `p[0 .. 10]` creates an array indexed from 0 - 9).
What if the line was:
int a[..] = p[0..12]
Do we still get undefined behaviour?> A bounds checked array can be turned into a pointer:
> int *p = &a[3]; // point to 3rd element of a[..]
Assuming that a indexes from 0 to 9, what happens when we use p with an out of range index, for example:
int *p = &a[8];
blah = p[3];
My main concern is how to tell other functions that the array has a maximum size, and how to determine (inside a function) what the maximum length of its parameters is.That's right, when a bounds checked array is converted to a pointer, the bounds does not go with it. Presumably, the function receiving the p has some way to determine the length (such as strlen, or via another parameter) from which the correct array can be reconstructed by doing a slice.
> What if the line was: int a[..] = p[0..12] Do we still get undefined behaviour?
Yes, if the 12 extends past the end of the data p points to.
> Assuming that a indexes from 0 to 9, what happens when we use p with an out of range index, for example: int *p = &a[8]; blah = p[3];
You get undefined behavior.
> My main concern is how to tell other functions that the array has a maximum size
The same way it's done now, by strlen, passing another argument with the length, or the function is able to get the length by other means. When a bounds checked array is converted to a pointer, the bounds are not part of the pointer.
int (p)[n] = malloc(sizeof p); (*p)[i] = 1; // run-time bounds check
https://godbolt.org/z/vb8dqx1od
But yes, having a type that included the bound makes sense. But I do not think using array syntax for pointers as in your proposal makes any sense.
Dennis Ritchie got it right: https://www.bell-labs.com/usr/dmr/www/vararray.pdf
Ritchie DM. Variable-size arrays in C. The Journal of C Language Translation 1990;2:81-86.
void foo(int n, int (*p)[n]) {
(*p)[n] = 1;
}
which has failed to catch on, because it still stores the pointer and the length as two separately handled objects.> Dennis Ritchie got it right
"This paper proposes to extend C by allowing pointers to adjustable arrays and arranging that the pointers contain the array bounds necessary to do subscript calculations and compute sizes."
It appears to be phat pointers.
In that aspect it's like when prototypes were added to C. Nothing changed for existing code, but prototypes are so advantageous people would retrofit existing code incrementally when doing routine maintenance.
Meanwhile, unlike C99, this construction is not allowed by any version of the C++ standards, any such use would be a non-standard extension, I think this is unfortunate. I only write C, I wonder if any C++ guru out there can answer this question: does modern C++ have a better solution to implement the same thing?
[1] https://devblogs.microsoft.com/oldnewthing/20040826-00/?p=38...
I'm no guru, but I know from experience you can do it in C++:
{0}[calvin ~] cat test.cpp
#include <iostream>
#include <memory>
struct foo {
int len;
int v[];
};
int main(void) {
auto p = std::unique_ptr<foo>(reinterpret_cast<struct foo *>(
malloc(sizeof(struct foo) + sizeof(int) * 2)));
p->v[1] = 99;
std::cerr << p->v[1] << std::endl;
return 0;
}
{0}[calvin ~] g++ -Wall -Wextra -std=c++17 test.cpp -o test
{0}[calvin ~] ./test
99
{0}[calvin ~] clang++ -Wall -Wextra -std=c++17 test.cpp -o test
{0}[calvin ~] ./test
99
EDIT: Remove unnecessary extern block, as pointed out by wahern in the replies.But in this case, as other said, it may be accepted as an extension. Still not part of C++
MyStructA a; MyStructB b;
a = malloc((sizeof a) + (sizeof b)); b = (MyStructB *)&a[1];
You need to make sure that the second struct doesn't have stricter alignment requirements than the one preceding it, but using this technique you can stack any number of structures or arrays of structures in one allocation.
(I would generally not recommend this coding style unless you have very specific requirements of memory usage)
MyStructA {
...
MyStructB b[];
};
MyStructA* a = malloc(sizeof(MyStructA) + sizeof(MyStructB));
b = &a->b[0];
(Except, of course, that the syntax for locating 'b' is nicer this way, because you don't have to explicitly address the memory after 'a' and cast it to 'MyStructB'.)No, the only standard way is to allocate a buffer large enough, and placement new the header and the bulk payload separately. The manual handling of alignment makes this very cumbersome.
Flexible array members would be nice, but I don't like that it is yet an other overload on some array declaration syntax (other meanings of "T x[]": 1) declare an array of unknown size, 2) in a definition, deduce the size). For some reason C likes to overload the array declaration syntax with widely different meanings (looking at you, VLAs).
For the latency to be significant digits, there must be a lot of repeated calls which only read/write a very small number of elements in the array. Otherwise the accumulation of read/write operations performed when iterating over the data would dwarf the single pointer dereference.
So I'm curious now-- do OS kernels spend most of their time doing lots of calls that dive into such dynamically-allocated data just to extract a single datum or two?
https://gist.github.com/cozzyd/efda739301bb7eb3a4a63a145c93e...
I'd say it's also not more dangerous (or equally dangerous, depending on your camp) than a pointer from malloc().
And interestingly COBOL handled this in a cleaner way. I forget some of the specfics but there was a way to specify to the compiler that one field of a record specified the length of the following array, allowing the same pattern in a type safe way.
> If the size of the space requested is zero, the behavior is implementation-defined: either a null pointer is returned to indicate an error, or the behavior is as if the size were some nonzero value, except that the returned pointer shall not be used to access an object
So it may actually allocate (although the allocation is unusable).
If malloc(0) gets called as first malloc in the program the system break does not need to be moved, as there is always 0 bytes space available... but malloc does like to move sysbreak by a large amount at a time to reduce the need for repeated calls...
I'm guessing malloc(0) does not move sysbreak and simply returns a pointer to the bottom of the heap?
I wish this trope would die. It really never was one.
I am not an expert on C nor assembly and would be curious if you could expand on this. The statement makes sense to me because my impression is that most of what happens in C code gets translated fairly straightforwardly to machine code, with the compiler taking care of bridging differences in the instruction sets of targeted architectures. I guess the reason this is simplistic is the inlining and loop unrolling done by an optimizing compiler. Is this what you mean?
[1] I don't know enough about the early history of C to be able to assert that it was never true, but it certainly hasn't been true since at least 1989.
Not saying C is a great standard, but that idea means that using C instead of assembly doesn't generate a big overhead, and it's still much easier to write C than assembly, especially when compiler support is very common.
I'm currently making a language that translates directly to C.
int some_int;
int some_array[] __attribute__((__element_count__(some_int)));
to store the size of some_array in some_int. struct
{
int a;
/* ... */
int b[];
} foo;
struct
{
struct foo;
/* ... */
int c;
} bar;
I would expect to have issues when trying to access the other members of struct bar like c.In other words the C standard could make bounds-checking of arrays possible with a good interop story if the standards committee believed it was worth doing. Compilers would have a flag to enable or disable the runtime checks based on safety/perf tradeoffs. Libraries would slowly add the relevant annotations. Eventually most code would have the option of having all array accesses bounds checked.
[1] https://learn.microsoft.com/en-us/cpp/code-quality/annotatin...
i give up, what does sizeof say? and why would it be sized by bytes?
> ...due to yet more historical situations (e.g. struct sockaddr, which has a fixed-size trailing array that is not supposed to actually be treated as fixed-size), GCC and Clang actually treat all trailing arrays as flexible arrays.
But I don't know, that doesn't seem to match the result I am getting with clang 13.1.6. It does seem to respect the array size declared in the struct, not treat it as a flexible array. I get -Warray-bounds warnings if I try to access anything past o->variable[3]. Maybe I'm misunderstanding what they're saying or my example is screwed up.
Edit: Actually, I guess it does end up treating it like a flexible array -- it produces -Warray-bounds warnings when compiling, but the resulting binary works (and doesn't trigger asan). Not sure I entirely understand it though.
Then fix codebases recursively until it's all ok. At ABI boundaries that don't use such types create values of such types corresponding to the given arguments (e.g., you could count the elements of `argv[]` then create a wrapper for the `argv`).
It seems that a memory safe C would be faster, in that you wouldn't have to learn yet another language and runtime to deploy your stuff.
Does anyone know where I can follow these discussions?