[Rhi] [bug] Fix the Unified Allocator to no longer return first two allocations as dupes by hughperkins · Pull Request #8705 · taichi-dev/taichi

hughperkins · 2025-04-30T23:50:03Z

Issue: #

Brief Summary

The unified allocator always allocated the first two allocations as dupe memory addresses, which always clobbered each other.

copilot:summary

New walkthrough

Looking at the Unified allocator code, we can see that when we first allocate a memory chunk, we do not add size to head, and thus the next allocation will receive the exact same address too. Thus, there will be two structs or similar, in memory, which clobber each other, leading to plausibly a plethora of hard-to-debug crashes.

High level overview of how allocator works

The allocator can work with two types of request:

exclusive
not exclusive

For exclusive requests:

a buffer is allocated from the system:
- taichi/taichi/rhi/common/unified_allocator.cpp
  
  Lines 74 to 75 in 562e05f
  
  void *ptr =
  
  HostMemoryPool::get_instance().allocate_raw_memory(allocation_size);
the size of the buffer matches the requested bytes (to within alignment bytes)
a new chunk is created
- chunk.data is set to the start of this buffer
- chunk.head is too, but it's not really used for exclusive access
- chunk.tail is set to the end of the buffer, but again not really used for exclusive access

Exclusive access requests are thus fairly straightforward

For non-exclusive, it is slightly more complex

for the first request,we allocate a much larger buffer than the request
- by default 1GB
  
  taichi/taichi/rhi/common/unified_allocator.cpp
  
  Lines 9 to 10 in 562e05f
  
  const std::size_t UnifiedAllocator::default_allocator_size =
  
  1 << 30; // 1 GB per allocator
we create a new chunk
- chunk.data is set to the start of this buffer
- chunk.head is set to the start of unused space in this buffer
  - it should be set to chunk.data + size
    - prior to this PR, it is incorrectly being set to point to chunk.data though
  - meaning that the next request will incorrectly return the start of this chunk, again
  - then we return chunk.head
for subsequent requests, we look for a chunk that has available space (head - tail <= requested size)
- when we find such a chunk:
  - we add size to head (to within alignment)
  - we return the old head (to within alignment)

Proposed fix

The proposed fix is to set head to data + size for newly allocated chunks

thinking about it, an alternative fix is to split the function into two parts:
- first part searches for an existing chunk, or makes a new one
  - does not return the allocated address
  - does not update head etc
- second part is always executed
  - updates head
  - returns old head

I don't really have a strong opinion on which fix we prefer. The second approach seems mildly cleaner perhaps, since decouples 'finding/creating a chunk' from 'updating the chunk and returning the requested memory pointer'.

Low level details

In more details, and assuming non exclusive mode:

let's say client requests size bytes
we allocate a chunk much larger than that, default_allocator_size bytes
- https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L65-L75
- the address of this chunk is stored in ptr
we create a chunk structure to store information about the chunk we just allocated
- https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L63
- https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L76-L79
- ptr is stored in chunk.data
- head is set to ptr too, via chunk.data
- tail is set to ptr + allocation size, via chunk.data
we return ptr
- https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L85
we should have added allocation_size to chunk.head
we can look at what happens when we re-use this chunk later, to confirm this:

When we re-use a chunk:

we loop over all allocated chunks, looking for non-exclusive chunks
- https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L39-L45
we add allocation size to head, adjusting for alignment, store that in ret, and check if ret is less than tail
- https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L49-L53
if ret is less or equal to tail, then we:
- update head to be equal to ret (so, we've updated it to be old head + allocation_size, adjusted for alignment)
- return ret
- (and break out of the loop, by virtue of the return)
otherwise, we ignore, and keep looping over available chunks
- (if no suitable chunks found, then we will allocate a brand new chunk)

Original Walkthrough

High level summary:

both the LLVMRuntime and the result_buffer are allocated to the same memory address
this results in the return error code from running a kernel overwriting the address of snode tree 19
this results in subsequent access to any field having snode tree 19 crashing Taichi

Reproducing the bug

This bug was initially reproduced in #8569 , but knowing what the bug is, we can reproduce it using the following much simpler code:

import taichi as ti

ti.init(arch=ti.arm64, debug=True)

fields = []
for i in range(20):
    fields.append(ti.field(float, shape=()))
    ti.sync()

@ti.kernel
def foo():
    fields[19][None] = 1

foo()
foo()

What this code does:

allocates snode trees 0 through 19, by creating fields indexed 0 through 19, and immediately calling ti.sync, to materialize the snode tree
- you can optionally print out the snode tree ids as long as you have a version of master that includes the PR at [lang] Add SNode.snode_tree_id #8697, to verify this assertion
following the creation of snode trees 0 through 19, we call a kernel twice
- the first kernel runs without issue
  - however, the address of snode tree 19 will be set to 0, following this kernel call, because it is overwritten by the return code of this call
- when we run the second kernel call, using the address of snode tree 19 - which is now set to 0 - to access values from snode tree 19, causes a segmentation fault:
[E 04/30/25 19:00:30.022 3136495] Received signal 11 (Segmentation fault: 11)

Detailed walkthrough

LLVMRuntime and result_buffer are allocated the same memory address

When we first initialize the LLVMRuntime, we:

allocate a result_buffer from the unified allocator, via the host allocator

result_buffer allocated here

taichi/taichi/runtime/llvm/llvm_runtime_executor.cpp

Lines 699 to 700 in 562e05f

    
           *result_buffer_ptr = (uint64 *)HostMemoryPool::get_instance().allocate( 
        
               sizeof(uint64) * taichi_result_buffer_entries, 8);

call runtime_initialize

here

taichi/taichi/runtime/llvm/llvm_runtime_executor.cpp

Lines 706 to 711 in 562e05f

    
           runtime_jit 
        
               ->call<void *, void *, std::size_t, void *, int, void *, void *, void *>( 
        
                   "runtime_initialize", *result_buffer_ptr, host_memory_pool, 
        
                   runtime_objects_prealloc_size, runtime_objects_prealloc_buffer, 
        
                   num_rand_states, (void *)&host_allocate_aligned, (void *)std::printf, 
        
                   (void *)std::vsnprintf);

passing in the result_buffer
and the host allocator

inside runtime_initialize, we:
- allocate the LLVMRuntime, using the same allocator
  - here
    
    taichi/taichi/runtime/llvm/runtime_module/runtime.cpp
    
    Lines 932 to 933 in 562e05f
    
    runtime =
    
    (LLVMRuntime *)host_allocator(memory_pool, sizeof(LLVMRuntime), 128);
interestingly, the address allocated for the LLVMRuntime memory is identical to the address of the result_buffer memory
- verifiable by printing out the two addresses. Over multiple runs, they consistently have the same address as each other (though the exact addresses vary between runs)
these are both allocated from the exact same allocator
- if you print out the address of the allocator in each location, they are identical
- and no deallocations take place between the allocations
- so, how is this possible?
looking at the unified allocator, there is a concept of 'exclusive'
- here
  
  taichi/taichi/rhi/common/unified_allocator.cpp
  
  Line 32 in 562e05f
  
  bool exclusive) {
if a request for memory is not marked as exclusive, previously allocated buffers can be re-used, and allocated to new requests
- here
  
  taichi/taichi/rhi/common/unified_allocator.cpp
  
  Line 57 in 562e05f
  
  return (void *)ret;
the default is exclusive = false
- here
  
  taichi/taichi/rhi/common/unified_allocator.h
  
  Line 31 in 562e05f
  
  bool exclusive = false);
therefore, by default, memory chunks allocated can be re-used/returned/allocated across multiple requests

Let's first walk through the effects of LLVMRuntime and result_buffer occupying the same space.

The return code of a kernel overwrites snode tree address 19

following a kernel launch, the method runtime_retrieve_and_reset_error_code is run on runtime.cpp

here

taichi/taichi/runtime/llvm/runtime_module/runtime.cpp

Lines 727 to 730 in 562e05f

    
           void runtime_retrieve_and_reset_error_code(LLVMRuntime *runtime) { 
        
             runtime->set_result(taichi_result_buffer_error_id, runtime->error_code); 
        
             runtime->error_code = 0; 
        
           }

this method calls runtime->set_result(taichi_result_buffer_error_id, runtime->error_code);
the first parameter is a constant
- defined here
  
  taichi/taichi/inc/constants.h
  
  Line 21 in 562e05f
  
  constexpr std::size_t taichi_max_num_ret_value = 30;
- constexpr std::size_t taichi_max_num_ret_value = 30;

set_result:

is here

taichi/taichi/runtime/llvm/runtime_module/runtime.cpp

Lines 600 to 604 in 562e05f

    
           void set_result(std::size_t i, T t) { 
        
             static_assert(sizeof(T) <= sizeof(uint64)); 
        
             ((u64 *)result_buffer)[i] = 
        
                 taichi_union_cast_with_different_sizes<uint64>(t); 
        
           }

sets result_buffer[i] to t
- here
  
  taichi/taichi/runtime/llvm/runtime_module/runtime.cpp
  
  Line 602 in 562e05f
  
  ((u64 *)result_buffer)[i] =
in this case, i is taichi_max_num_ret_value
- which is 30
t is the return code
- empirically this has a value of 0, in the test cases described above
i is used to index onto an array of i64
- here
  
  taichi/taichi/runtime/llvm/runtime_module/runtime.cpp
  
  Line 602 in 562e05f
  
  ((u64 *)result_buffer)[i] =
therefore each element of the array has 8 bytes
and therefore to get the address of the element which will be set to 0, we should multiply the index, which is 30, by 8
- thus, we will zero out 8 bytes at byte offset 30 * 8 = 240
the base address for this offset is result_buffer
- however, result_buffer has the same address as LLVMRuntime
- (as discussed in the first section)
- so we are going to clobber 8 bytes in LLVMRuntime with zeros, at offset 240
let's now look at where byte offset 240 is in LLVMRuntime

LLVMRuntime struct:

is here

taichi/taichi/runtime/llvm/runtime_module/runtime.cpp

Lines 552 to 562 in 562e05f

    
           struct LLVMRuntime { 
        
             PreallocatedMemoryChunk runtime_objects_chunk; 
        
             PreallocatedMemoryChunk runtime_memory_chunk; 
        
             host_allocator_type host_allocator; 
        
             assert_failed_type assert_failed; 
        
             host_printf_type host_printf; 
        
             host_vsnprintf_type host_vsnprintf; 
        
             Ptr memory_pool; 
        
             Ptr roots[kMaxNumSnodeTreesLlvm];

there are two PreallocatedMemoryChunks, each of which contains two pointers and a size_t

taichi/taichi/runtime/llvm/runtime_module/runtime.cpp

Lines 546 to 550 in 562e05f

    
           struct PreallocatedMemoryChunk { 
        
             Ptr preallocated_head = nullptr; 
        
             Ptr preallocated_tail = nullptr; 
        
             std::size_t preallocated_size = 0; 
        
           };

each pointer is 8 bytes, and size_t likely 8 bytes, for 24 bytes each
48 bytes total

host_allocator_type is a pointer to function -> 8 more bytes
assert_failed_type, host_printf_type, host_vsnprintf_type, and Ptr are also all pointers, so 8 bytes each, for a total for them of 32 bytes
now we arrive at roots, which is the snode tree roots address array
- at this point, we are at an offset of 48 + 8 + 32 = 88
- so our offset into roots will be 240 - 88 = 152
- each element of roots is also a pointer
- so size 8 bytes
- 152 bytes / 8 bytes = 19
- thus when we write the return code of 0 to result_buffer[30], we clobber the address of tree snode 19 with 0

kernel access of tree snode 19

when a kernel is initialized, and that kernel uses a field that is allocated on snode tree 19:
- the lowered kernel calls %10 = call ptr @LLVMRuntime_get_roots(ptr %9, i32 19 (exact statement index varies depends on the kernel of course)
  - this will return 0
- then when we access a field based on offset 0, we crash.

Proposed fix

~~We need to ensure that the allocator does not allocate the same memory block twice, both to the LLVMRuntime and to the result_buffer~~

~~my proposed fix is to expose the exclusive parameter to the LLVMRuntime~~
~~and to set this parameter to true, when used from the runtime~~

Questions

~~A question in my mind is, why we would ever want exclusive to not be true. And by default, it is in fact set to false. I feel like there is some knowledge or insight that is missing to me.~~

copilot:walkthrough

hughperkins · 2025-05-01T09:09:39Z

(updated with fixed allocator code. Looks like the first two allocations from the allocator always get same address, in master. This PR fixes that. I'll also add some unit tests somehow/somewhere).

hughperkins · 2025-05-01T09:55:02Z

(added unit tests for allocator)

YilingQiao · 2025-05-01T17:52:01Z

Here is the script I used previously to reproduce a similar bug. The PR does indeed fix it.

import taichi as ti
ti.init(arch=ti.cpu, debug=True)
for _ in range(100):
    tmp = ti.field(dtype=ti.i32, shape=1)
    print(tmp[0])
    print(tmp[0])

hughperkins · 2025-05-01T22:49:39Z

@YilingQiao Great! Also, nice work on such a compact minimum reproducible example 🙌 Just 6 lines 😮

hughperkins · 2025-05-02T11:23:12Z

Note that whilst there's a failing test, it is not for a 'required' check. I expect it's because I'm allocating 1GB of test memory 😅 My idea is to upgrade the Allocator to be able to be passed a 'default allocator size', so we don't have to leak 1GB memory for the tests. But since this makes the allocator more complicated, and might break stuff in itself, I'd prefer to do that in a second PR, ideally, unless anyone has strong opinions about my breaking the IOS build? (will basically triple the complexity of this PR I feel).

hughperkins · 2025-05-02T23:31:25Z

A couple of other options accur to me:

make the test not run on IOS
move the allocator test to a separate test executable
- it will still allocate 1GB
- but the main test executable allocates 1GB anyway
  - it's just that we currently allocate an additional 1GB...

bobcao3

Nice work, thanks!

hughperkins · 2025-05-07T07:47:22Z

Great! 🙌 Thank you :)

hughperkins changed the title ~~[Llvm] [Bug] Fix crash bug causing snode tree root address to be overwritten by return error code~~ Apr 30, 2025

hughperkins changed the title ~~[Llvm] [Bug] Fix crash bug causing snode tree root 19 address to be overwritten by return error code~~ Apr 30, 2025

hughperkins changed the title ~~[Llvm] [Bug] Fix crash bug caused by overwriting snode tree root 19 address with return error code~~ Apr 30, 2025

hughperkins changed the title ~~[Llvm] [Bug] Fix crash bug caused by overwriting snode tree root 19 address with return code~~ Apr 30, 2025

hughperkins force-pushed the hp/fix-allocate-same-memory-twice branch from 5d334f1 to 9c98bd5 Compare May 1, 2025 02:00

fix allocator

5a33be3

hughperkins force-pushed the hp/fix-allocate-same-memory-twice branch from 666dce9 to 5a33be3 Compare May 1, 2025 09:08

add host memory pool test

2d14344

dummy

c94b1b8

hughperkins changed the title ~~[Llvm] [bug] Fix crash bug caused by overwriting snode tree root 19 address with return code~~ May 1, 2025

hughperkins changed the title ~~[Rhi] [bug] Fix the Host Allocator~~ May 1, 2025

hughperkins added 2 commits May 1, 2025 06:44

dummy

8ca2458

dummy

ac2ddaf

hughperkins changed the title ~~[Rhi] [bug] Fix the Unified Allocator~~ May 1, 2025

see if we can skip running on ios

9c763d6

bobcao3 reviewed May 5, 2025

View reviewed changes

Comment thread tests/cpp/rhi/common/host_memory_pool_test.cpp Outdated

hughperkins added 3 commits May 6, 2025 06:40

Modify default allocation size during allocator test

d7fa823

remove old comment

b2482eb

remove ios conditional

9adf116

hughperkins mentioned this pull request May 6, 2025

Debug flag causes segfault on macos, arc=cpu, simple example #8706

Open

dummy

725fd13

bobcao3 approved these changes May 7, 2025

View reviewed changes

bobcao3 merged commit 8fe47c7 into taichi-dev:master May 7, 2025

hughperkins deleted the hp/fix-allocate-same-memory-twice branch May 7, 2025 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Rhi] [bug] Fix the Unified Allocator to no longer return first two allocations as dupes#8705

[Rhi] [bug] Fix the Unified Allocator to no longer return first two allocations as dupes#8705
bobcao3 merged 10 commits into
taichi-dev:masterfrom
hughperkins:hp/fix-allocate-same-memory-twice

hughperkins commented Apr 30, 2025 •

edited

Loading

hughperkins commented May 1, 2025

hughperkins commented May 1, 2025 •

edited

Loading

YilingQiao commented May 1, 2025

hughperkins commented May 1, 2025

hughperkins commented May 2, 2025

hughperkins commented May 2, 2025

Uh oh!

bobcao3 left a comment

hughperkins commented May 7, 2025

Labels

3 participants

	void *ptr =
	HostMemoryPool::get_instance().allocate_raw_memory(allocation_size);

	const std::size_t UnifiedAllocator::default_allocator_size =
	1 << 30; // 1 GB per allocator

	result_buffer_ptr = (uint64 )HostMemoryPool::get_instance().allocate(
	sizeof(uint64) * taichi_result_buffer_entries, 8);

	runtime_jit
	->call<void , void , std::size_t, void , int, void , void , void >(
	"runtime_initialize", *result_buffer_ptr, host_memory_pool,
	runtime_objects_prealloc_size, runtime_objects_prealloc_buffer,
	num_rand_states, (void )&host_allocate_aligned, (void )std::printf,
	(void *)std::vsnprintf);

	runtime =
	(LLVMRuntime *)host_allocator(memory_pool, sizeof(LLVMRuntime), 128);

	void runtime_retrieve_and_reset_error_code(LLVMRuntime *runtime) {
	runtime->set_result(taichi_result_buffer_error_id, runtime->error_code);
	runtime->error_code = 0;
	}

	void set_result(std::size_t i, T t) {
	static_assert(sizeof(T) <= sizeof(uint64));
	((u64 *)result_buffer)[i] =
	taichi_union_cast_with_different_sizes<uint64>(t);
	}

	struct LLVMRuntime {
	PreallocatedMemoryChunk runtime_objects_chunk;
	PreallocatedMemoryChunk runtime_memory_chunk;

	host_allocator_type host_allocator;
	assert_failed_type assert_failed;
	host_printf_type host_printf;
	host_vsnprintf_type host_vsnprintf;
	Ptr memory_pool;

	Ptr roots[kMaxNumSnodeTreesLlvm];

	struct PreallocatedMemoryChunk {
	Ptr preallocated_head = nullptr;
	Ptr preallocated_tail = nullptr;
	std::size_t preallocated_size = 0;
	};

Uh oh!

Conversation

hughperkins commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Brief Summary

New walkthrough

High level overview of how allocator works

Proposed fix

Low level details

Original Walkthrough

High level summary:

Reproducing the bug

Detailed walkthrough

Proposed fix

Questions

hughperkins commented May 1, 2025

hughperkins commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

YilingQiao commented May 1, 2025

hughperkins commented May 1, 2025

hughperkins commented May 2, 2025

hughperkins commented May 2, 2025

Uh oh!

bobcao3 left a comment

Choose a reason for hiding this comment

hughperkins commented May 7, 2025

Labels

3 participants

hughperkins commented Apr 30, 2025 •

edited

Loading

hughperkins commented May 1, 2025 •

edited

Loading