A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.