We consider the problem of automatically assigning a category to a given question posted to a Community Question Answering (CQA) site, where the question contains not only text but also an image. For example, CQA users may post a photograph of a dress and ask the community "Is this appropriate for a wedding?" where the appropriate category for this question might be "Manners, Ceremonial occasions." We tackle this problem using Convolutional Neural Networks with a DualNet architecture for combining the image and text representations. Our experiments with real data from Yahoo Chiebukuro and crowdsourced gold-standard categories show that the DualNet approach outperforms a text-only baseline (p = .0000), a sum-and-product baseline (p = .0000), Multimodal Compact Bilinear pooling (p = .0000), and a combination of sum-and-product and MCB (p = .0000), where the p-values are based on a randomised Tukey Honestly Significant Difference test with B = 5000 trials.