With the aim of achieving a computationally efficient optimization of kernel-based probabilistic models for various problems, such as sequential pattern recognition, we have already developed the kernel gradient matching pursuit method as an approximation technique for kernel-based classification. The conventional kernel gradient matching pursuit method approximates the optimal parameter vector by using a linear combination of a small number of basis vectors. In this paper, we propose an improved kernel gradient matching pursuit method that introduces orthogonality constraints to the obtained basis vector set. We verified the efficiency of the proposed method by conducting recognition experiments based on handwritten image datasets and speech datasets. We realized a scalable kernel optimization that incorporated various models, handled very high-dimensional features (>100 K features), and enabled the use of large scale datasets (> 10 M samples).