"We're showing all our dirty laundry": A scientific crisis in psychology

Attempts to replicate studies often failed to produce results and left the field reeling.

By C.J. Robinson

Amy Cuddy in her power pose TED talk
Power posing: a seemingly simple solution for nerves. Image credit: PopTech via Flickr

A simple enough solution for public speaking nerves or a high-stakes job interview: look powerful. Amy Cuddy, in a 2012 TED Talk, spoke passionately about her research as a prominent social psychologist. Studies she'd worked on demonstrated that by simply posing like a superhero, one can alter not only their behavior, but their literal physical self. Decreased stress hormones, increased confidence and better life outcomes were all shown to stem from holding a stance for mere minutes.

“Don't fake it till you make it," Cuddy says in her talk. "Fake it till you become.”

When researchers attempted to rerun the study using different subjects or larger samples, though, they did not observe the same effect. In fact, they were unable find anything close to the magnitude of the original studies. They published their findings.

Power posing quickly became an example of psychology's "replication crisis." After decades of research, in the early 2010s academics started sharing their attempts to build on previous papers, only to find they could not demonstrate the effect at all. Social psychology, with its famed studies which seeped into popular culture, was undergoing a seismic shift.

Replications often don't generate the same magnitude of effect

Replication by original and replication effect size, normalized

1

0.5

Replication Effect

Power posing

0

-0.5

0

0.25

0.5

0.75

1

Original effect

1

0.5

Replication Effect

Power

posing

0

-0.5

0

0.25

0.5

0.75

1

Original effect

1

0.5

Replication Effect

Power

posing

0

-0.5

0

0.25

0.5

0.75

1

Original effect

Source: FORRT Replication Database

Here's a study that attempted to replicate the power posing hypothesis.

If the study showed a successful replication, it would be at or near this line. Instead, power posing shows a near-zero effect in its replication compared to the original study.

Of over 500 verified studies, most show no replication effects.

Anything below the line shows a smaller effect size than previously published

Over 50 even showed the opposite effect as their original findings.

The extent of the replication crisis is difficult to understand. Researchers may choose to replicate studies for various reasons that may lead to a bias in the types of studies replicated most often, such as the simplicity of the study design or how reasonable the hypothesis is.

The Framework for Open and Reproducible Research Training, or FORRT, attempts to collect replication attempts of psychological studies in an open-source and independently verified database. Users can submit replication attempts found throughout publications, and the organizations will verify the results in the name of "advancing research transparency, reproducibility, rigor, and ethics through pedagogical reform and meta-scientific research."

Of the 505 collected replications, over 60% did not find statistically significant effects reported in the original studies. Others found results that were signficant, but demonstrated the opposite direction of the original effect.

More than 60% of attempted replications failed, with some even showing reverse effects

Replication studies by result

Opposite effect

No effect

Failure

Success

Opposite effect

No effect

Failure

Success

Opposite effect

No effect

Failure

Success

Source: FORRT Replication Database

Brian Nosek is a psychologist, professor and co-founder of the Center for Open Science who helped bring transparency to the field. In graduate school, he found that colleagues could not replicate seminal papers but had no incentive to share that with the larger field.

"We'd go to the bar at the conference, and other labs would say, 'we can't replicate that either,'" Nosek said. "Why aren't you publishing this?"

Flawed studies and hypotheses became popular for many reasons. First, publications preferred results with significant effects, meaning journals often ignored papers that may have shown unsuccessful replications or disproven hypotheses. More extreme effects also draw more attention from readers, incentivizing journals to prefer those with larger or exciting findings.

The focus on publishable results also led individual academics to skew their analysis toward data that showed a statistically significant result. Usually, this does not mean actively manipulating or changing raw test data. Instead, by continously filtering data by different subgroups such as age or gender, they introduce the chance that some relationships look significant. In actuality, these differences are mostly likely due to chance. This method of analyzing data until you find a significant result is often called 'p-hacking,' referencing the p-value which indicates something is statistically significant.

Additional reasons include sample bias, which may occur when a study utilizes too few people or participants who don't represent a population. Often, studies are run exclusively on undergraduate students of the researcher's university who attempt to extrapolate results to the broader public.

Specific domains of psychology saw substantial shifts in citations and the number of new publications in those domains, Nosek said. The fields most impacted by the unsuccessful replications were ego depletion (exhausting someone's willpower to change their behavior), priming (showing or telling a person something in advance to change their reaction to other stimuli) and terror management (how knowledge of death alters cultural signifcance and personality), all of which fall under the "social psychology."

Of the subdomains listed in the FORRT database, only 22% findings were successfully replicated in social psychology. Differential psychology, the most replicable field in the dataset, focuses on individual and group differences in behavior.

Social psychology was the subdomain with the largest percentage of failed replication

Percentage of successful replications by subdomain

Social psychology

25%

General psychology

32%

Marketing

41%

Experimental philosophy

74%

Differential psychology

89%

Social psychology

25%

General psychology

32%

Marketing

41%

Experimental philosophy

74%

Differential psychology

89%

Social psych.

25%

General psych.

32%

Marketing

41%

Experimental phil.

74%

Differential psych.

89%

Source: FORRT Replication Database

At the level of individual research papers, there was relatively little change, according to Nosek. The number of times other researchers would cite unsuccessfully replicated papers has not shown to decline meaningfully after the replication.

Joining the FORRT database with citation counts from Google Scholar, there was little difference between a successful replication and a failure. For papers with completely successful replications, the median number of citations was 205, compared to papers with at least one failed replication having 201.

This finding contracticts other papers. Researchers previously demonstrated that specific overall citation counts are higher on average for unsuccessful replications, most likely due to the more extreme findings.

Withstanding the debate around citation counts of individual papers, psychology as a whole responded with advancements in transparency and the creation of a metascience community. The Center for Open Science popularized Registered Reports, allowing researchers to publish without sharing results first and focus on methodology. Nosek also pointed to increased reproducibility, which will benefit accountability overall.

"It has to happen at a scale broader than the individual by individual replication to have impact," Nosek said.

Methodology

The data was based on and downloaded from FORRT's replication database, found here. After compiling the data, I used each reference to gather total citation counts from Google Scholar, utlizing Playwright to automatically search the list of references and BeautifulSoup to scrape the data. I also analyzed each title to ensure that the first entry from Google Scholar's result was the correct study in the database.

To bypass Google's captchas, I utlized an automated extension called NopeCHA that solved CAPTCHA tasks that prevent bots from accessing the site. I then calculated the median citation counts based on whether or not the replication was successful.