tensorflow::Tensor - PaddlePaddle/Paddle GitHub Wiki
tensorflow::Tensor
represents a n-dimensional array of values, like caffe2::Tensor
.
Different from caffe2::Tensor<Context>
, which is a template, tesnorflow::Tensor
is a class.
caffe2::Tensor<Context>
's constructor doesn't allocate memory; instead, memory allocate is delayed till the mutable_data
is called. Whereas tensorflow::Tensor
allocates the memory.
caffe2::Tensor<Context>
's template methods data<T>
and mutalbe_data<T>
can return an array of any typed elements -- caffe2::Tensor::meta_
records the most recently returned (and allocated) element type. Whereas tensorflow::Tensor
's constructor accepts a DataType
typed parameter that specifies the element type.
caffe2::Tensor<Context>
supports only numerical typed elements. Whereas tensorflow::Tensor
supports string
-typed elements.
caffe2::Tensor<Context>
doesn't support accessing data in protobuf messages. Whereas tensorflow::Tensor
does.
caffe2::Tensor<Context>
's destructor doesn't free memory; instead, its data member shared_ptr<T> data_
does. Whereas tensorflow::Tensor
's destructor takes the responsibility to free memory. In addition, tensorflow::Tensor
counts the reference of the memory by itself, whereas caffe2::Tensor<Context>
utilizes shared_ptr
for that.
The shape of a tensor is represented by tensorflow::TensorShape
, which can be constructed from a list of int64
values, or from a protobuf message TensorShapeProto
.
TensorShape
supports various representations of a shape because most tensors are low dimensional. This brings more complexity than Caffe2's vector<int64_t>
. Indeed, tensor_shape.h
and tensor_shape.cc
take 759 lines of C++ code in total -- more than the very candy majel::Dim
that takes 498 lines.
The constructor of tensorflow::Tensor
accepts a parameter Allocator* a
and passes it to a newly created tensorflow::Buffer
object tensorflow::Tensor::buf_
:
Tensor::Tensor(Allocator* a, DataType type, const TensorShape& shape)
: shape_(shape), buf_(nullptr) {
set_dtype(type);
CHECK_NOTNULL(a);
if (shape_.num_elements() > 0 || a->ShouldAllocateEmptyTensors()) {
CASES(type, buf_ = new Buffer<T>(a, shape.num_elements()));
}
tensorflow::Buffer
then saves a
into its parent class tensorflow::BufferBase
's alloc_
field, and it calls Allocator::Allocate<T>
:
template <typename T>
Buffer<T>::Buffer(Allocator* a, int64 n)
: BufferBase(a), data_(a->Allocate<T>(n)), elem_(n) {}
Allocator::Allocate<T>
calls Allocator::AllocateRaw
and then call type T
's constructors via Allocator::RunCtor<T>
:
template <typename T>
T* Allocate(size_t num_elements,
const AllocationAttributes& allocation_attr) {
...
void* p = AllocateRaw(kAllocatorAlignment, sizeof(T) * num_elements,
allocation_attr);
T* typed_p = reinterpret_cast<T*>(p);
if (typed_p) RunCtor<T>(typed_p, num_elements);
return typed_p;
}
By default, Allocator::RunCtor<T>
is an no-op, so it doesn't construct basic types. A specialization runs string type's constructor:
template <>
inline void Allocator::RunCtor(string* p, size_t n) {
RunStringCtor(p, n);
}
Similarly, there are corresponding Allocator::RunDtor<T>
defines.
Allocator::AllocateRaw
calls port::AlignedMalloc
:
void* AllocateRaw(size_t alignment, size_t num_bytes) override {
void* p = port::AlignedMalloc(num_bytes, alignment);
...
return p;
}
and Allocator::DeallocateRaw
calls port::AlignedFree
:
void DeallocateRaw(void* ptr) override {
...
port::AlignedFree(ptr);
}
port:AlignedMalloc
, port::AlignedFree
, and other platform-independent memory allocation are in tensorflow/core/platform/mem.h:
namespace tensorflow {
namespace port {
void* AlignedMalloc(size_t size, int minimum_alignment);
void AlignedFree(void* aligned_memory);
void* Malloc(size_t size);
void* Realloc(void* ptr, size_t size);
void Free(void* ptr);
}
}
There are two implemntations:
- POSIX implemenation in tensorflow/core/platform/posix/port.cc just calls POSIX C-runtime functions like malloc. For example:
void* Malloc(size_t size) {
#ifdef TENSORFLOW_USE_JEMALLOC
return jemalloc_malloc(size);
#else
return malloc(size);
#endif
}
- Windows implementation in tensorflow/core/platform/windows/port.cc is almost identical with the POSIX one, because the C-runtime functions are almost the same.
Above two implementation both allocates CPU memory, but not GPU memory.
TensorFlow codebase doesn't call cudaMalloc
. Instead, there is one function, perftools::gputools::cuda::CUDADriver::DeviceAllocate
, that calls cuMemAlloc
:
/* static */ void *CUDADriver::DeviceAllocate(CudaContext *context,
uint64 bytes) {
...
CUresult res = cuMemAlloc(&result, bytes);
Class CUDADriver
includes a set of static methods, each corresponds to a CUDA API. For example, CUDADriver::DeviceDeallocate
calls cuMemFree
:
/* static */ void CUDADriver::DeviceDeallocate(CudaContext* context,
void *location) {
...
CUresult res = cuMemFree(pointer);
Only CUDAExecutor::Allocate(uint64 size)
calls CUDADriver::DeviceAllocate(context_, size)
:
void *CUDAExecutor::Allocate(uint64 size) {
return CUDADriver::DeviceAllocate(context_, size);
}
And I haven't figured it out how/if Tensor calls CUDAExecutor::Allocate
for GPU memory.