In this paper, we propose a method for improving the performance of semantic video indexing. Our approach involves extracting features from multiple convolutional neural networks (CNNs), creating multiple classifiers, and integrating them. We employed four measures to accomplish this: (1) utilizing multiple evidences observed in each video and effectively compressing them into a fixed-length vector; (2) introducing gradient and motion features to CNNs; (3) enriching variations of the training and the testing sets; and (4) extracting features from several CNNs trained with various large-scale datasets. Using the test dataset from TRECVID's 2014 evaluation benchmark, we evaluated the performance of the proposal in terms of the mean extended inferred average precision measure. On this measure, our system's performance was 35.7, outperforming the state-of-the-art TRECVID 2014 benchmark performance of 33.2. Based on this work, our submission at TRECVID 2015 was ranked second among all submissions.