The massive datasets are often collected under non-IID distribution scenarios, which enforces existing federated learning (FL) frameworks to be still struggling on the model accuracy and convergence. To achieve heterogeneity-aware collaborative training, the FL server aggregates gradients from different clients to ingest and transfer common knowledge behind non-IID data, while leading to information loss and bias due to statistical weighting. To address the above issues, we propose a Gradient Memory-based Federated Learning (GradMFL) framework, which enables Hierarchical Knowledge Transferring over Non-IID Data. In GradMFL, a data clustering method is proposed to categorize Non-IID data to IID data according to the similarity. And then, in order to enable beneficial knowledge transferring between hierarchical clusters, we also present a multi-stage model training mechanism using gradient memory, constraining the updating directions. Experiments on solving a set of classification tasks based on benchmark datasets have shown the strong performance of good accuracy and high efficiency.