Evaluating European performance on the world stage for one particular year seems a reasonably straightforward exercise. The question, after all, is relatively simple: “Did Europeans do well or badly in 2010?” However, devising a methodology in order to make a rigorous and consistent judgment across issues and over time is a tricky enterprise that is fraught with unsatisfying trade-offs and inevitable simplifications. Before explaining the methodology used in this scorecard, we discuss some of the difficulties and dilemmas we faced while devising the methodology. This discussion is meant to offer some perspective on the choices we made and to ensure full transparency about the results.
Among the many difficulties involved with evaluating Europe’s performance in its external relations, two stand out: the problematic definition of success in foreign policy; and the rigidity of the time frame used.
What is a good European foreign policy?
The nature of international politics is such that “success” and “failure” are not as easily defined as they would be in other public-policy areas. In particular, there is no quantitative tool that can adequately capture performance in foreign policy as in economic policy or social policy (e.g. unemployment rate, crime rate, pollution levels, etc.). Diplomacy is more often about managing problems than fixing them, biding time, choosing the worst of two evils, finding an exit strategy, saving face, etc. States often pursue multiple objectives, and their order of priority is often unclear or disputed. This, of course, is even truer in the case of Europe, in which two member states might have different views on what exact mix of objectives met during the year constitutes success in one policy area, even when they agree on common objectives.
This difficulty is compounded by the heterogeneous nature of foreign policy. Europeans expect their authorities to solve the Israeli-Palestinian conflict, to prevent the proliferation of nuclear weapons, to turn Bosnia and Herzegovina into a functioning state, to protect ships from pirates in the Gulf of Aden, to stabilise the eastern neighbourhood, to defend European values at the UN and speak up for human rights, to convince other countries to fight climate change, to open foreign markets for exporters, to impose European norms and standards to importers, and so on. “Success” is defined very differently in each case: it can be a matter of convincing other actors in a negotiation, building diplomatic coalitions, delivering humanitarian aid on the ground, imposing peace on a region torn by civil unrest, building a state, spreading global norms, etc. Moreover, Europe has very different abilities in each of them, not unlike the way that a student has different abilities in various subjects (e.g. mathematics, languages, physical education, etc.). This makes a unified grading system problematic by creating a dilemma between respecting the specificity of each “subject” on the one hand and ensuring that evaluations are comparable across the scorecard on the other.
Grading the rate of success of Europeans (the “outcome” score) relies on a comparison between the European objectives and the outcome for 2010. But the problem mentioned above resurfaces: who speaks for Europe? There is rarely a single entity to define what the European interest is – what priorities and tradeoffs are desirable when conflicting objectives exist. Even where there is broad agreement on a policy, official texts will rarely present the real extent of European objectives, or will do it in vague, consensual terms. Therefore, simply comparing stated objectives with results would have led to an incomplete assessment of performance. It was generally necessary for us to go further and spell out explicitly what the European objectives were in one particular domain in order to compare them to results – a difficult and eminently political exercise.
What’s more, the causal link between one specific set of European policies on the one hand and results on the other is problematic. European objectives can sometimes be met regardless of the European policy put in place to achieve them. For example, independent factors might have modified the context in which actors operate (e.g. forest fires in Russia, rather than EU influence, led to a different attitude of Moscow towards climate change), or other states might have helped to attain the objectives sought by Europeans (e.g. the United States in getting China to support sanctions against Iran). But the opposite can also be true: failure can happen even with the optimal policies in place (e.g. the US Congress decision to abandon cap-and-trade legislation in spite of best efforts by Europeans to convince them otherwise).
This problem of causal disjuncture between policy and result led us to make two choices for the scorecard. First, we do not try to sort out the reasons for European “success”, let alone try to offer a co-efficient of European agency or credit. While we always specify other factors that contributed to a positive outcome, we deem Europeans to be successful if their objectives were met. In other words, they are not penalised for having been helped by others. This is why we use the word “outcome” rather than “results” or “impact” which imply a direct causality. Second, we clearly separate policy from results. The grade for each component reflects an equal balance between input (graded out of 10) and outcome (graded out of 10) and output (graded out of 10), so that the reader can better appreciate the problematic correlation between the two. (The policy grade, or input, is divided into two scores, each graded out of 5: “unity” and “resources”.) Very good policies and best efforts can meet outright failure (e.g. the failure to get the US Congress to move on climate change). However, the opposite situation rarely occurs: luck, it turns out, is not so prevalent in international affairs.
Still, giving as much weight to policy as to results is a delicate choice that has several implications. It means that Europeans can get a score of 8, 9 or even 10/20 by having a policy we consider optimal, but a score of 0/10 or 1/10 for “outcome”. In other words, Europeans get a reasonably good grade for simply having a coherent policy in place, even if this policy produces few results. The other implication is that similar grades can mean different things. For example, on visa liberalisation with Russia (component 15), Europeans got 4/5 for “unity” and 3/5 for “resources” but only 3/10 for “outcome” – a total of 10/20. This is the same score as for relations with the US on counter-terrorism and human rights (component 31), where Europeans got 3/5 for “unity” and 2/5 for “resources” but a significantly better score of 5/10 for “outcome”.
Beyond the question of merits and results lies the question of expectations. If the scorecard has to spell out what European objectives were, it also has to define the yardstick for success, in the absence of obvious or absolute reference points to assess the underlying level of difficulty – and hence the level of success – in each area. We relied on judgment, based in each case on an implicit alternative universe representing the optimal input and outcome, against which actual European performance was measured. But while it was based on extensive expertise, this approach was necessarily subjective. This is particularly the case because, while it had to be realistic, it also had to avoid either lowering ambitions excessively or demanding impossible results. As noted in the Preface, this is where the political and sometimes even subjective nature of the scorecard is greatest.
It should also be noted that the relative nature of our judgment and the question of expectations contain an even more political question, that of European leverage – and, this time, the difficulty concerns both the policy score (i.e. “unity” and “resources”) and the results score (i.e. “outcome”). We evaluated performance in the context of 2010, and tried to be politically realistic about European possibilities, about what resources could be mobilised in support of a particular policy. But some observers might object that with some extra will or leadership by the main actors, additional resources could have been mustered to increase European leverage, to the point of completely reconfiguring the political context of a particular issue. For example, on the Israeli-Palestinian conflict, some argue that Europe should take much more drastic and aggressive measures to reach its objectives. For example, it could unilaterally recognise a Palestinian state at the United Nations and bilaterally, or cease its Association Agreement with Israel and impose other trade sanctions. Admitting such proposals as realistic would change the score for “resources” (which, compared to this standard, would become dismal for 2010), and might potentially have changed the “outcome” grade as well. Here again, we had to make judgment calls about the adequacy of resources in the current European foreign-policy debate as we see it. It remains, however, a political judgment.
When does the clock stop?
A second set of problems has to do with the time frame of the scorecard. Evaluating foreign-policy performance is difficult enough, but it becomes even more difficult when you only consider events that took place during one calendar year. It is well known that some past policies that have yielded remarkable results in the short term proved less effective, and sometimes even disastrous, in the long term – for example, western support for the mujahideen in Afghanistan in the 1980s. The cost of some policy decisions has gradually increased over time – for example, the admission of Cyprus as an EU member state in the absence of resolution of the Northern Cyprus problem. Since the scorecard is an annual exercise, this will inevitably become an issue, especially after policies and actions we now vaunt prove less compelling in a few years, and vice versa. To some extent, however, this is the same problem we face in evaluating success not in absolute terms but as a function of possibilities and difficulty. We do not pass definitive historical judgment but rather a contextualised judgment within the bounds of the year 2010.
However, even that caveat does not solve the second dilemma: the possible bias in favour of short-term, tangible results that could be observed during the year 2010, to the detriment of more profound and meaningful, if less spectacular, policies and outcomes. For example, visa conditionality in the Balkans is exerting a continuing positive pressure and having good results, although these results are not evident on the larger, more visible political scene. The problem is that the scorecard tends to register movement, and while a European programme that is already in place can be mentioned in the text, it will often come second to the sometimes ephemeral political battles that unfolded during the year. Thus, a limited but very visible political initiative towards a candidate country might eclipse the more important fact that the whole power relationship between Europe and this country is overdetermined by this candidacy. This bias is especially important when it comes to common security and foreign policy, since many aspects of the foreign relations of the EU take the form of long-term aid, development and rule of law programmes rather than short-term political initiatives. The scorecard tries to strike a balance between recognising the specificity, assets and successes of Europe as a different, new type of international power on the one hand, and considering Europe as a traditional great power, in the league of the US, China or Russia, on the other hand – a role it cannot escape in today’s world.
This dilemma explains why, even though we insist on tangible results for 2010 and hold Europe to demanding standards of efficiency, we still give credit to and make room for patient background work and positions of principles, even if they seemed to have had no impact in 2010. After all, it was easy to criticise Europe for its failure to persuade the US to close Guantánamo prison until President Obama finally ordered its closure in 2009. It would be inaccurate to claim that the constant political and moral pressure that Europeans exercised played no role, and yet impossible to point out exactly what role they played in Obama’s decision. Similarly, Europe’s ongoing support of the development of the Palestinian Authority as a more effective and less corrupt administration is the type of behind-the-scenes work that is not always visible but could be hugely important in the future.
This question of time frame leads to the larger question of “good” foreign policies. We cannot assess whether policies are “good” – only whether Europeans are united around them, whether they devote resources to them, and whether (or to what extent) they reach their various objectives. In a sense, therefore, our judgment remains technical. For example, we find Europe’s performance on Iran in 2010 to be better than on many other issues, but if Tehran suddenly acquires and uses a nuclear weapon in 2011, critics will point out that Europe’s policy was not forceful enough and that the good grades we gave now look overblown. Similarly, if a revolution leads to the overthrow of the mullahs, critics will point out the immorality of European foreign policy that focused on the nuclear programme and reinforced the hardliners, while a more conciliatory position might have hastened the downfall of the regime.
This problem of normative judgment leads to a more general question: how much shall we take into account things Europe is not doing? For example, should Europe get a bad grade because it was not present (in terms of either words or actions) in the China-Japan dispute of September 2010 about the Senkaku/ Diaoyutai islands, where the future of world peace might be at stake? As discussed earlier, we have tried to strike a balance in the scorecard. On the one hand, we have graded existing policies and taken into account the specificity of EU foreign policy and what Europe actually is (i.e. long-term programmes and a certain vision of what the international system should be). On the other hand, we have graded according to “great power” norms, emphasising what Europe ultimately should be (e.g. an assertive power playing the multi-polar game).
The points above illustrate the difficulties and dilemmas involved in devising a methodology that can withstand criticism. This is why we call this project a scorecard rather than an index. Indices use hard quantitative data (e.g. UNDP’s Human Development Index; Brookings’ Iraq Index) or scores given by observers to qualitative data (e.g. Freedom House’s Freedom in the World or Freedom in the Press indices; Transparency International’s Corruption Perceptions Index), or a mixture of both (Institute for Economics and Peace’s Global Peace Index; Legatum Institute’s Prosperity Index). A scorecard, on the other hand, is transparent about the subjective nature of judgment and the heterogeneity of the material it grades, and is therefore a better tool for appraising foreign-policy performance. After all, the grades one gets in school are a function of the particular teacher doing the grading and are based on different criteria for each subject. However, this neither prevents the scorecard from being significant nor means that grades are purely arbitrary, especially when overall results are based on an average of a large number of exercises and as consistent a scale across the board and over time as is feasible.
The scorecard was developed in three phases. In the first phase (during the summer and autumn of 2010), experts for each of the six “issues” drew up the list of “subissues” and “components” – the discrete elements that the scorecard actually evaluates for 2010. This choice, obviously, was fundamental as it determined what we were assessing within each of the six “issues” and was therefore the subject of intense discussion. The experts also provided preliminary assessments of European performance (for the period running from January to September) in each “component”, based on their own knowledge and a range of interviews with officials and specialists. In particular, they identified European objectives – a key precondition for evaluating performance. The experts devised questions for member states in order to better understand the dynamics of each component. In the second phase (from November to December 2010), questionnaires on about 30 of the “components” on which the experts felt they needed additional information were sent to researchers in each of the 27 member states, who collected information from officials in their country and completed the questionnaires. This provided a much more granular image of European external relations on critical issues. In the third phase (January 2011), experts wrote the final assessments and the introductions for each issue. It was at this point that scores for each component were given. The scores and the assessments were then discussed with the scorecard team and shared with other experts and officials.
The scorecard uses three criteria to assess European foreign-policy performance: “unity” (“Were Europeans united?”), “resources” (“Did they try hard?”), and “outcome (“Did they get what they wanted?”). The first two evaluate the intrinsic qualities of European policies and are graded out of 5; the third criterion evaluates whether these policies succeeded or failed, and is graded out of 10. The overall numerical score out of 20, which was converted into an alphabetical grade, therefore reflects an equal balance between input and outcome.
In some cases, the scores for each of these three criteria are based on an average of several different elements of a “component”. For example, component 62, which evaluates European performance on Somalia, includes three disparate elements: the Atalanta naval mission; the training of Somali military personnel in Uganda; and financial support to the African Union peacekeeping mission AMISOM. Similarly, component 24, which evaluates relations with Russia on Afghanistan and Central Asia, has three elements: Afghanistan, Kyrgyzstan and security in Central Asia in general.
The key question on “unity” is: Do Europeans (that is, member states and EU institutions) agree on specific and substantial objectives or do they have a variety of different policies, with some adopting initiatives and taking stances that contradict the common policy?
Scores were awarded on the following basis:
The key question on “resources” is: Did Europeans (that is, member states and EU institutions) devote adequate resources (in terms of political capital and tangible resources such as money, loans, troops, training personnel and the like) to back up their objectives in 2010? In other words, was their policy substantial?
Scores were awarded on the following basis:
The key question is: To what extent have European objectives been met in 2010, regardless of whether Europeans (that is, member states and EU institutions) were responsible for that outcome?
Scores were awarded on the following basis:
Numerical scores and alphabetical grades
Scores for “unity”, “resources” and “outcome” were added and converted into grades in the following way:
Grades for issues and sub-issues
As indicated above, “components” are gathered in groups called “sub-issues”. The grade for a sub-issue simply results from the average of the grades for its components. Similarly, the grade for an issue such as crisis management or Relations with China simply results from the average of the grades for its subissues. This, of course, raises the question of the proper weight to grant to each component within a sub-issue, and to each sub-issue within an issue. For example, should the grade for China depend equally on the three sub-issues (Trade liberalisation and overall relationship; Human rights and governance; Cooperation with China on regional and global issues), or should one of them be granted more weight? Rather than engaging in a delicate exercise of weighting (for example, by giving co-efficients of importance to various components), we decided to build into the list a rough equality among components within a sub-issue and among sub-issues within an “issue”. It could be argued that some components and sub-issues have not been given their proper weight. However, such a judgment would be no less political than the grade given to that component.
In the 2012 edition of the Scorecard, we attempted to explore role played by individual member states in European foreign policy as well as evaluating European performance as a whole. However, we chose to add this second dimension of assessment in only in a small number of components because in many cases – particularly those where member states have empowered the EU institutions to negotiate or otherwise act on their behalf – it would make little sense to compare and contrast the roles they played. In 2011 we therefore categorised member states on 30 of the 80 components of European foreign policy where they played a particularly significant positive or negative role.
In each of these 30 components – between 4 and 7 per chapter – we categorised some member states as a “leaders” and others as “slackers”. Other member states were simply “supporters” of common and constructive policies that were in our view in the European interest – a kind of default category that can encompass many different attitudes, from active support to passive acquiescence. Clearly, categorising member states in this way is not an exact science. Like the grading of European performance as a whole, each categorisation of a member state involved a political judgement and should therefore not be considered definitive. In particular, it assumes a normative judgement on what constitutes a policy that is in the European interest. In addition, given the diverse nature of the components of European foreign policy in the Scorecard, what it means to be a “leader” or “slacker” varies in each case.
We identified member states as “leaders” when they either took initiative in a constructive way or acted in an exemplary way (for example by devoting disproportionate resources). In other words, it is possible for member states to “lead” either directly (in other words by forcing or persuading member states to take action) or indirectly (“leading by example”). Thus on the one hand we identified France and the UK as “leaders” on component 75 (The Libyan uprising) because they took initiative in pushing for military intervention and successfully persuaded the US and other actors to agree to impose a “no-fly zone”. On the other hand weidentified 7 member states as “leaders” on component 74 (Development aid and global health) because they either maintained high levels of aid at a difficult time (for example Sweden and the UK) or even increased their aid budgets (Bulgaria, Finland and Germany) in 2011.
Conversely, we identified member states as “slackers” when they either impeded or blocked the development of policies that serve the European interest in order to pursue their own narrowly defined or short-term national interests or did not pull their weight (for example by failing to devote proportionate resources). In other words, it is also possible for member states to “slack” either directly (by preventing member states taking action) or indirectly (setting a bad example). Thus we identified Germany and Poland as “slackers” on component 75 (The Libyan uprising) because they opposed military intervention, thus eliminating the possibility of a CSDP mission, but also failed to devote resources commensurate with their size even after NATO took over command of the operation in April. We identified 11 member states as “slackers” on component 74 (Development aid and global health) because they either failed to increase low levels of aid (for example Italy) or cut their aid budgets in 2011.