Strange behavior when allocating an OpenCL buffer using CL_MEM_READ_ONLY

Dear fellow developers,

 

 

it seems that when creating an OpenCL buffer and specifying both CL_MEM_READ_ONLY and CL_MEM_ALLOC_HOST_PTR will result in the AMD platform allocating write-combined host memory. A simple example to reproduce this behavior is posted below. (I am using a Radeon Pro WX 5100, Windows 10 (64-bit) and the latest Radeon Pro driver.)

 

One thing that is rather curious is that when not passing CL_MEM_ALLOC_HOST_PTR but calling the map-command directly instead, the host allocation made available by the runtime is not allocated as write-combined (cf. the output generated by the program.)

 

#define __CL_ENABLE_EXCEPTIONS

// C++ includes
#include <iostream>
#include <string>
#include <vector>

// Windows API
#include <Windows.h>

// OpenCL includes
#include <CL/cl.hpp>


int main( void ) {
    try {
        std::vector< cl::Platform > platforms;
        std::vector< cl::Device > devices;

        // Platform selection
        cl::Platform::get( &platforms );
        const cl::Platform &platform = platforms[ 0 ];

        // Device selection
        platform.getDevices( CL_DEVICE_TYPE_GPU, &devices );
        const cl::Device &device = devices[ 0 ];

        // Print platform information
        std::string name;
        std::string version;
        platform.getInfo( CL_PLATFORM_NAME, &name );
        platform.getInfo( CL_PLATFORM_VERSION, &version );
        std::cout << "(Using the platform " << name << " at version " << version << ")" << std::endl;

        cl_context_properties props[ 3 ] = { CL_CONTEXT_PLATFORM, (cl_context_properties) (platform) (), 0 };
        cl::Context ctx( device, props );
        cl::CommandQueue queue( ctx, device );

        size_t bufferSize = 2048 * 1024 * sizeof( float );

        {
            cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY, bufferSize );

            float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));

            MEMORY_BASIC_INFORMATION memInfo;
            if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )
            {
                std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
                std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
            }

            queue.enqueueUnmapMemObject( buffer, bufferHost );
        }
        {
            cl::Buffer buffer = cl::Buffer( ctx, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, bufferSize );

            float *bufferHost = static_cast<float*>(queue.enqueueMapBuffer( buffer, CL_TRUE, CL_MAP_READ, 0, bufferSize ));

            MEMORY_BASIC_INFORMATION memInfo;
            if ( VirtualQuery( reinterpret_cast<void*>(bufferHost), &memInfo, sizeof( memInfo ) ) )
            {
                std::cout << "Host allocation as write-combined: " << ((memInfo.AllocationProtect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
                std::cout << "Host memory is write-combined: " << ((memInfo.Protect & PAGE_WRITECOMBINE) ? "Yes" : "No") << std::endl;
            }

            queue.enqueueUnmapMemObject( buffer, bufferHost );
        }

        queue.finish();
    } catch ( cl::Error &error ) {
        std::cerr << "OpenCL C++ API Exception during " << error.what() << ": " << error.err() << std::endl;
    }

    return 0;
}

 

I would like to argue that automatically allocating host memory associated with a CL_MEM_READ_ONLY | CL_MEM_ALLOCATE_HOST_PTR buffer as write-combined is not a good idea and indeed believe that this should be classified as a bug. You could have, e.g. one thread filling a buffer, using that buffer during a computation and have another thread reading that buffer in order to save it to a log file. When allocating as write-combined, this reading of the buffer will take a long time (up to 26 times slower on my system). Instead I would like to suggest that the runtime should only allocate the host memory as write-combined if, in addition, CL_MEM_HOST_WRITE_ONLY is specified (as is kind of suggested by the OpenCL specification).

 

Any comments on this observation? Thanks in advance for your replies.

 

 

Kind regards

bcaf01

 

PS: I would appreciate it if someone could add me to the white-list and move this topic to the appropriate developer forum!