More than ten state-of-the-art regional air quality models have been applied as part of the Air Quality Model Evaluation International Initiative (AQMEII). These models were run by twenty independent groups in Europe and North America. Standardised modelling outputs over a full year (2006) from each group have been shared on the web-distributed ENSEMBLE system, which allows for statistical and ensemble analyses to be performed by each group. The estimated ground-level ozone mixing ratios from the models are collectively examined in an ensemble fashion and evaluated against a large set of observations from both continents. The scale of the exercise is unprecedented and offers a unique opportunity to investigate methodologies for generating skilful ensembles of regional air quality models outputs. Despite the remarkable progress of ensemble air quality modelling over the past decade, there are still outstanding questions regarding this technique. Among them, what is the best and most beneficial way to build an ensemble of members? And how should the optimum size of the ensemble be determined in order to capture data variability as well as keeping the error low? These questions are addressed here by looking at optimal ensemble size and quality of the members. The analysis carried out is based on systematic minimization of the model error and is important for performing diagnostic/probabilistic model evaluation. It is shown that the most commonly used multi-model approach, namely the average over all available members, can be outperformed by subsets of members optimally selected in terms of bias, error, and correlation. More importantly, this result does not strictly depend on the skill of the individual members, but may require the inclusion of low-ranking skill-score members. A clustering methodology is applied to discern among members and to build a skilful ensemble based on model association and data clustering, which makes no use of priori knowledge of model skill. Results show that, while the methodology needs further refinement, by optimally selecting the cluster distance and association criteria, this approach can be useful for model applications beyond those strictly related to model evaluation, such as air quality forecasting. (C) 2012 Elsevier Ltd. All rights reserved.