Deep neural networks (DNNs) have extensively promoted data generation development; the quality of these generated content has achieved an impressive new level. Therefore, manipulated content, especially facial manipulation, is a growing concern for online information legitimacy. Most current deep learning-based methods depend on local features sampled by convolutional kernels and lack knowledge globally. To address the problem, we propose a dual-path pipeline using Neural Ordinary Differential Equations (NODE) based neural network and facial-feature biased transformer to deal with the visual content from a different view. The transformer path could link these landmarks in a long-range, moreover, we adopt an attention guided augmentation based self-ensemble for more robust performance. Extensive experiments show that our system could surpass several commonly used approaches in terms of video-level accuracy and AUC with better interpretability.