My Kd-Tree is pretty much working now. Tree is constructed with complexity O(n*log(n)) which is expected from a well designed Kd-Tree generator. My implemention goes like this:
First 3 arrays are created containing split-candidates for the 3 different axis x,y and z. Each split candidate is defined like this:
struct SSplit
{
real fSplit; // Split position of event
uint32 iItem; // Atom the split belongs to
bool bLeft; // true if event is a left event, false if it is a right event
uint32 nLeft; // number of objects left to split position
uint32 nRight; // number of objects right to split position
}

Next 3 other arrays are created which are simply pointers to the same split-candidates. I have implemented an < operator for the pointer, so we can sort the pointers instead of the split candidates. Handling of pointers are much more light weight than handling the split candidates themselves.

Now the split candidate pointers are sorted using standard C++ sort functions. This sorting is kept for the rest of the algorithm to avoid having to resort splits for every node that is computed.

Next the call to the recursive algorithm starts. For each set of split candidates, the best splitting candidate is chosen as the one with the lowest cost. The split candidates are then split into a left side and a right side, and each branch is split recursively until the algorithm decides that a leaf node is cheaper than a branch.

The cost function computes the cost by looking at the number of objects on each side of the split and the probability of hitting that side (surface area relative to parent area) and goes like this:
fCost = fCostTraversel + fCostIntersect*(nl*al + nr*ar)/fArea;

Since the split candidates are already sorted the number of objects on left and right side of each split candidate (this number is used in the cost function) is done by simply counting the objects while iterating through them from left to right and right to left.

Each KdTreeNode takes up 12 bytes which are spend like this:
struct KdTreeNode
{
real m_fSplit; // Splitting position according to split plane m_eAxis
union
{
KdTreeNode* m_pChild; // Pointer to two nodes
KdTreeItem* m_pItems; // Pointer to items in leaf node
};
struct
{
unsigned int m_eType : 1; // ENodeType, leaf=0, branch=1
unsigned int m_eAxis : 2; // ENodeAxis, x=0,y=1,z=2
unsigned int m_iItemCount : 29; // Number of KdTreeItems in leaf
};
};