While convolutional neural network (CNN) has been successfully used in many fields including single-label scene classification, it is vital to note that real world scenes generally contain multiple semantics and multi-label, especially in the indoor scene classification due to its content complexity. At the same time, most approaches try to make the network much deeper to make sure that they can extract more detail information. However, the deeper network will cause a lot of problems such as the increase of computational costs and network costs and so on. In order to solve these problems, this paper presents a novel framework which called Joint-CNN based on the proposed special label extraction and network structure. Extensive experiments on various data sets show that our method has enhanced the performance on MIT indoor67 and SUN397 data sets.