Accurate hand joints detection from images is a fundamental topic that is essential for many applications in computer vision and human-computer interaction. This paper presents a two-stage network for hand joints detection from a single unmarked image by using serial-parallel multi-scale feature fusion. In stage I, the hand regions are located by an encoder-decoder network, and the features of each detected hand region are extracted by a shallow spatial hand features representation module. The extracted hand features are then fed into stage II, which consists of serially connected feature extraction modules with similar structures, called “multi-scale feature fusion” (MSFF). An MSFF contains parallel multi-scale feature extraction branches, which generate initial hand joint heatmaps. The initial heatmaps are then mutually reinforced by the anatomic relationship between hand joints. The hand joint detection accuracy shows that the proposed network overperforms the state-of-the-art methods on current datasets, 1) RHD, 2) HS, 3) MPII & NZSL, 4) DCD8-6000, with the PCK@0.2 of 0.94, 0.92, 0.84, 0.97. Meanwhile, one hand in the image takes between 24 and 37 milliseconds to process, which is adequate for supporting many real-time applications. Copyright © 2023 Elsevier B.V. All rights reserved.