This paper analyzes the effectiveness of different modalities in automated speaking proficiency scoring in an online dialogue task of non-native speakers. Conversational competence of a language learner can be assessed through the use of multimodal behaviors such as speech content, prosody, and visual cues. Although lexical and acoustic features have been widely studied, there has been no study on the usage of visual features, such as facial expressions and eye gaze. To build an automated speaking proficiency scoring system using multi-modal features, we first constructed an online video interview dataset of 210 Japanese English-learners with annotations of their speaking proficiency. We then examined two approaches for incorporating visual features and compared the effectiveness of each modality. Results show the end-to-end approach with deep neural networks achieves a higher correlation with human scoring than one with handcrafted features. Modalities are effective in the order of lexical, acoustic, and visual features.