This paper discusses the online estimation of time- frequency masks, which enables us to perform mask-based beamforming by online processing for robust automatic speech recognition (ASR). Two approaches to online mask estimation have been separately developed for this purpose. One is based on a deep neural network (DNN), which exploits the spectral features of the signal. The other is based on spatial clustering (SC), which exploits the spatial features of the signal. This paper proposes a new method that integrates the two online estimation approaches to further improve online mask estimation by exploiting the advantages of both approaches. Experiments using the real data of the CHiME-3 multichannel noisy speech corpus show that the proposed method greatly outperforms the conventional approaches in terms of improving the word error rate (WER).