Simulations from chemical weather models are subject to uncertainties in the input data (e.g. emission inventory, initial and boundary conditions) as well as those intrinsic to the model (e.g. physical parameterization, chemical mechanism). Multi-model ensembles can improve the forecast skill, provided that certain mathematical conditions are fulfilled. In this work, four ensemble methods were applied to two different datasets, and their performance was compared for ozone (O3), nitrogen dioxide (NO2) and particulate matter (PM10). Apart from the unconditional ensemble average, the approach behind the other three methods relies on adding optimum weights to members or constraining the ensemble to those members that meet certain conditions in time or frequency domain. The two different datasets were created for the first and second phase of the Air Quality Model Evaluation International Initiative (AQMEII). The methods are evaluated against ground level observations collected from the EMEP (European Monitoring and Evaluation Programme) and AirBase databases. The goal of the study is to quantify to what extent we can extract predictable signals from an ensemble with superior skill over the single models and the ensemble mean. Verification statistics show that the deterministic models simulate better O3 than NO2 and PM10, linked to different levels of complexity in the represented processes. The unconditional ensemble mean achieves higher skill compared to each station's best deterministic model at no more than 60 % of the sites, indicating a combination of members with unbalanced skill difference and error dependence for the rest. The promotion of the right amount of accuracy and diversity within the ensemble results in an average additional skill of up to 31 % compared to using the full ensemble in an unconditional way. The skill improvements were higher for O3 and lower for PM10, associated with the extent of potential changes in the joint distribution of accuracy and diversity in the ensembles. The skill enhancement was superior using the weighting scheme, but the training period required to acquire representative weights was longer compared to the sub-selecting schemes. Further development of the method is discussed in the conclusion.