Reasons of glDrawElements(BaseVertex) having huge driver overhead?

I'm currently working on the refactoration of a complex and likely CPU/driver bound renderer.

In a nutshell:

- all rendering is done on a background thread (consumer) which is fed through a blocking queue of "rendering tasks"

- the main thread (producer) does a lot of small batch draw calls like "moveto(a, b) - lineto(c, d) - lineto(e, f) - ... - flush()"; inserting a task into the queue when necessary

- I designed the abstract interface so that it resembles modern graphics APIs (Metal/Vulkan), so I have buffers, textures, renderpasses and graphics pipelines

- the corresponding GL objects are cached, whenever possible (VAOs, programs, framebuffers), so that GL object creations are minimized

- GL state changes are not managed (with a few exceptions like framebuffer, VAO and program changes)

- buffer data uploads are optimized with the GL_MAP_UNSYNCHRONIZED_BIT; should the buffer become full, I do a glFenceSync+glWaitSync (doesn't happen too often tho)

 

Despite my efforts, so far the rendering is incredibly slow. VTune shows that the majority of the CPU time is spent in glDrawElementBaseVertex:

 

shot.png

Same thing on Intel cards. It pretty much looks like a driver limit case, however the funny thing is, that the old DX9 implementation is like 200x faster (hell, even GDI is faster).

 

So my question is: what might cause a drawing command to have such overhead? Or in other words: what state changes should I look out for to avoid this?