Vision Transformer (ViT) has recently been introduced into the computer vision (CV) field with its self-attention mechanism and gotten remarkable performance. However, simply applying ViT for hyperspectral image (HSI) classification is not applicable due to 1) ViT is a spatial-only self-attention model, but rich spectral information exists in HSI; 2) ViT needs sufficient training samples, but HSI suffers from limited samples; 3) ViT does not well learn local features; 4) multi-scale features for ViT are not considered. Furthermore, the methods which combine convolutional neural network (CNN) and ViT generally suffer from a large computational burden. Hence, this paper tends to design a suitable pure ViT based model for HSI classification as the following points: 1) spectral-only vision transformer with all tokens' aggregation; 2) spatial-only local-global transformer; 3) cross-scale local-global feature fusion, and 4) a cooperative loss function to unify the spectral and spatial features. As a result, the proposed idea achieves competitive classification performance on three public datasets than other state-of-the-art methods.