A stand-in is a common technique for movies and TV programs in foreign languages. The current stand-in that only substitutes the voice channel results awkward matching to the mouth motion. Videophone with automatic voice translation are expected to be widely used in the near future, which may face the same problem without lip-synchronized speaking face image translation. In this paper, we propose a method to track motion of the face from the video image, that is one of the key technologies for speaking image translation. Almost all the old tracking algorithms aim to detect feature points of the face. However, these algorithms had problems, such as blurring of a feature point between frames and occlusion of the hidden feature point by rotation of the head, and so on. In this paper, we propose a method which detects movement and rotation of the head given the three dimensional shape of the face, by template matching using a 3D personal face wireframe model. The evaluation experiments are carried out with the measured reference data of the head. The proposed method achieves 0.48 angle error in average. This result confirms effectiveness of the proposed method.