Compared with the great successes achieved by supervised learning, e.g. convolutional neural network (CNN), unsupervised feature learning is still a highly-challenging task suffering from no training labels. Because of no training labels for reference, blindly reducing the gap between features and image semantics is the most challenging problem. This paper proposes a Self-Taught Encoder-Decoder Network (STED-Net), which consists of a representation sub-network and a classification sub-network, for unsupervised feature learning. On one hand, the representation sub-network maps images to feature representation. On the other hand, using the features generated by representation sub-network, classification sub-network simultaneously maps feature representation to class representation and estimates pseudo labels by clustering feature representation. By minimizing the distance between class representation and the estimated pseudo labels, STED-Net teaches the features to represent class information. Through the self-taught feature representation, the gap between features and image semantics is reduced, and the features are promoted to be more and more “class-aware”. The whole learning process of the STED-Net does not refer to any ground-truth class labels. Experimental results on widely-used image classification datasets prove that STED-Net achieves state-of-the-art classification performance compared with existing supervised and unsupervised feature learning models.
ASJC Scopus subject areas
- コンピュータ ネットワークおよび通信