When you have any kind of merge algorithm to merge two groups into one with one thread.
And now you want to merge two big groups with more then one thread you have to partition your two group in as many sub groups as threads you want to merge with.
And then merge these new pairs of sub-groups with your normal merge algorithm.
Little example, hope I can explain it.
group1:
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
group2:
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
in this example i want to merge with 4 threads, so i have to split up each group int 4 sub groups.
I do this like this, first:
from each group i pick 3 values, which partition the groups equally
group1: 7 15 23
group2: 8 16 24
now i sort these 6 values and i get 7 8 [color=red]15[/color] 16 23 24
Now i want to merge the 1st sub-group of group1 with all elements of group2 which are smaller then 7.
I know that those values from group2 will be in the first sub-group of group2, because of the sorting of the partition values. So i get with a binary search the index of the last value which is smaller then 7 from that sub-group. and merge now those two sub-groups and put them at the beginning of the output.
I do this with each sub-group of group1 and copy the remaining elements from group2 to the output.
To get the index of the sub-group of group2 in which to search for the border value from the partitioning of group1. One can just subtract the sorted index of the border value of the sub-group from its index in the group.
i.e. the border 15 from group1 has the index 1 in group1 but in the sorted list the index is 2. So 2-1 = 1 tells us that we will find the biggest value of group2 which is smaller then 15 in the 2nd sub-group of group2.
This should scale very well, because the overhead from sorting the border values and the binary searches are very small, because the number of values to sort/search are very small.
When doing a merge-sort the number of groups half each merge so one would double the number of sub-groups for each merge to keep all threads busy.