As data sets grow to exascale, automated data analysis and visualization are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this form of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. We report the first shared SMP algorithm for fully parallel contour tree computation, with formal guarantees of O(lg V lg t) parallel steps and O(V lg V) work for data with V samples and t contour tree supernodes, and implementations with more than 30× parallel speed up on both CPU using TBB and GPU using Thrust and up 70× speed up compared to the serial sweep and merge algorithm.