INFERENTIAL STATISTICS

February 9, 2024 Rebecca Sivyer

PART THREE: A BEGINNER’S GUIDE TO INFERENTIAL STATISTICS.

SPECIFICATION:

Students should demonstrate knowledge and understanding of inferential testing and be familiar with inferential tests.
Levels of measurement: nominal, ordinal and interval.Introduction to statistical testing: the sign test. When to use the sign test; calculation of the sign test.
Probability and significance: use of statistical tables and critical values in interpreting significance; Type I and Type II errors.
Factors affecting the choice of statistical test, including level of measurement and experimental design. When to use the following tests: Spearman’s rho, Pearson’s r, Wilcoxon, Mann-Whitney, related t-test, unrelated t-test and Chi-Squared test.

If you have followed my posts on data analysis & handling PART 1 and PART 2, then you should be familiar with the following:

Quantitative and Qualitative Data
Continuous and Discreet Data
Levels of Measurement: Nominal, Ordinal, Interval & Ratio Data
Descriptive Statistics: Measure of Central Tendency & Dispersion
Graphs and Tables.

If you are unfamiliar with the above terms, please investigate them before you carry on with this section, or you will find it difficult to fully understand the reasoning behind the decision-making process in inferential statistics.

KEY TERM:

The word INFERENTIAL is characterised by the conclusions reached based on evidence and reasoning.

INFERENTIAL STATISTICS

After descriptive statistics, researchers must perform an inferential test on their data.

Here are a few examples of inferential statistical tests selected for their relevance to introductory psychology courses. Remember, this list is not exhaustive; many additional tests are available and widely used in the field.

The Sign Test For details about calculating the sign test, click here: SIGN TEST.
Mann-Whitney U
Unrelated T Test
Wilcoxon Matched pairs
Related T Test
Spearman’s Rho
Pearson’s Product
Chi-Square

WHAT INFERENTIAL TEST SHOULD YOU CHOOSE?

Before selecting an inferential statistical test for your research, it's crucial to understand seven key aspects of your data and research objectives. Without knowing these, a test cannot be selected.

These criteria will seem puzzling now but there's no cause for concern as each one will be throughly explored individually to ensure a comprehensive understanding is achieved. By the end of the inferential statistics chapter, you'll understand these seven key points and their importance to choosing tests.

CRITERIA NEEDED TO CHOOSE AN INFERENTIAL TEST:

What is your level of significance, e.g., 001, 0.05, or 0.10 (1%, 5%, 10%)?
What is your level of measurement of your data (e.g., nominal, ordinal, interval, or ratio)?
Is the hypothesis directional (1-tailed) or non-directional (2-tailed)?
Do you need a test of difference, association or correlation?
If you need a test of difference, what design are you using: Independent Group, Matched Pairs or Repeated Measures?
What is the size of your sample? For example, does it have less than 30 participants?
If your data is parametric or not

NOW, LET’S LOOK AT EACH OF THESE CRITERIA IN GREATER DETAIL…

PROBABILITY AND CHANCE

Probability is the branch of maths that calculates how likely the occurrence of an event is. The probability of an event, mathematically, is a number between 0 and 1. In short, 0 indicates the impossibility of the event occurring; for example, I could bet on the zero possibility of turning into a reptile by sunrise.

1 indicates the certainty of an event - I can’t think of many events that would occur with absolute certainty. - even death - maybe it’s possible that our memories could get transferred into a form of A. I sometime in the future. I’ll go with, I’m 100% certain the sun will rise tomorrow.

In reality, most events have probabilities between 0 and 1 because most events can’t be predicted with absolute certainty or uncertainty.

TAKEAWAYS

Probability Is expressed as p.
Probability, or p, is expressed as a number between 0 and 1
P = 0 means an event won’t happen, e.g., you can’t pick a joker if there are none in the pack
P = 1 Means that an event will happen, e.g., you pick a joker from a pack of jokers.
P = 0.50 Heads or tails
P = 0.25 Picking a heart card from a full deck of cards.
Scientists look at conditional probability: this is the probability of an event if something else occurs; for example, there is a chance anyone can develop hair loss, but this will increase if the person is male and above 40 years old (the conditions).
The reason probability is between 0 and 1 is the way that it is calculated.

To calculate the probability that a particular outcome will occur, it has to be divided by the number of possible outcomes. Probability = Total number of outcomes in which X happens. Total number of possible outcomes, e.g. heads and tails. The Probability of getting a head is one divided by two = .5 because there are only two possible outcomes.

WHAT’S THE PROBABILITY?

Finding Mutual Love: 1 in 562 chance
Having Twins: 1 in 67 chance
Becoming a Millionaire: 1 in 55 chance
Winning the Lottery (Weekly Ticket): 1 in 100,000,000 chance
Involved in a Drunk Driving Crash (Lifetime): 2 in 3 chance
Going Blind After Laser Eye Surgery: 1 in 5,000,000 chance
Being Injured by a Toilet: 1 in 5,000 chance
Dying in a Motor Vehicle Crash: 1 in 102 chance
Dying in an Airplane Crash: 1 in 205,552 chance
Dying from a Hornet, Wasp, or Bee Sting: 1 in 54,093 chance
Being Hit by Lightning: 1 in 114,195 chance
Killed by Fireworks: 1 in 340,733 chance
Dying from a Shark Attack: 1 in 3,748,067 chance
Scoring a Hole-in-One in Golf: 1 in 5,000 chance
Becoming an Astronaut: 1 in 13,200,000 chance
Winning an Olympic Medal: 1 in 662,000 chance
Getting Stuck in a Lift (Per Ride): 1 in 100,000 chance (USA data)

HOW DOES PROBABILITY RELATE TO INFERENTIAL STATISTICS?

Consider discovering a bag of fifty-pound notes on the way to work or school. Such an event is likely a rare occurrence—a one-off instance. Similarly, researchers question whether they would obtain consistent results with similar populations.

For example:

SCENARIO 1

A psychologist creates the following hypothesis: "Participants in the jogging condition will rate photographs of the opposite sex higher than participants in the non-jogging condition” Soon afterwards, the psychologist conducts the experiment and gets the following descriptive data.

Remember, the Jogging v non- jogging hypothesis from data handing and analysis (step 1)?

Descriptive data will summarise your raw data and make it easier to analyse. It will also tell you if your experiment or study has worked.

However, the research process extends beyond descriptive statistics. For instance, the findings presented in Table 1 are not guaranteed to happen again. The results could be by chance. Researchers need to know that other psychologists could obtain the same results when replicating the jogging study.

Here is another analogy to really illustrate the point.Suppose a coin is flipped twenty times, and the outcomes of heads and tails are recorded.

The results are as follows.

Q. What is the probability of getting 19 heads and one tail every time a coin is flipped 20 times?

A. The probability of getting 19 heads and one tail every time a coin is flipped 20 times is extremely unlikely. This outcome would typically be considered a chance or random result.

Descriptive statistics alone cannot address these questions because they do not facilitate the calculation of the probability of a result recurring or never occurring again

Without conducting an inferential test, it is impossible to calculate the probability of results occuring again.

WHY DO WE NEED INFERENTIAL TESTS?

This is where inferential statistics come in as they calculate the probability of a result occurring again or not; in other words, is the finding a chance result, or would it occur again and again? The ultimate goal of inferential statistics is to generalise findings from the sample in the study to similar populations. This is because samples are often biased, so it’s important to find out how likely it is that they accurately reflect what happens in the population.

Scientists call this aspect of the research process statistical significance, so if a phenomenon doesn’t occur by chance, it is deemed statistically significant. If it occurs by chance, it is deemed not statistically significant.

NB: Most scientists call it “significant” or “not significant” - the prefix “statistical” is used with less frequency now.

WHAT IS A SIGNIFICANCE LEVEL?

Most things in life require a certain amount of evidence before people try them, for example, taking the Covid vaccination. If any behaviour were deemed too risky, e.g., if 90% of people exploded when travelling on aeroplanes, then the likelihood of people flying would be zero.

We use this evidence-based system frequently in our daily lives; for example, our criminal justice system rests upon the notion that prosecutors must prove the defendant is guilty beyond a reasonable doubt; judgements are very stringent and require strong evidence, e.g., at least 90%, of the jury, must agree to find a defendant guilty or not guilty. There are protocols for evidence because convicting innocent people is ethically wrong.

For science, the significance level is the jury because it tells the researcher the likelihood of obtaining a chance result. Scientific research will only be deemed credible if statisticians convince the world that their results can be applied to the population and are not random occurrences.

More specifically, the significance level is the percentage of chance a researcher is willing to allow. For example, If I am 99% certain the sun will rise tomorrow, I am leaving a 1% possibility to chance that it won’t - the percentage I am willing to concede, e.g. the 1%, is the significance level.

CONFIDENCE LEVEL: The level of certainty about something, e.g., I am 100% certain I am female. The confidence level in scientific research also assesses the probability that the results will be the same if research is repeated.
SIGNIFICANCE LEVEL ( p-value, also known as alpha, is the probability that the null hypothesis is true ) It is the degree of uncertainty you are willing to concede about the results of your study, e.g., if nothing is 100% certain, what percentage of chance will you allow? More formally, the significance level is the maximum risk of rejecting a true null hypothesis you are willing to take, usually set at 5% but can also be 1% & 10%.
Incidentally, confidence levels and significance levels go hand in hand, e.g., if I have a 95% confidence level that my dog is a thoroughbred Golden Retriever, then I must acknowledge there is a 5% chance (significance level )that he is not, e.g., he has DNA from other breeds.

It is important to know that the confidence level is usually set at 90% or higher, e.g., 95% and 99%. The corresponding significance levels would then be 10%, 5% and 1%.

TYPES OF SIGNIFICANCE LEVEL

Researchers could choose a zero per cent significance level. Still, it isn't likely that they would ever be able to make any prediction with that level of confidence, as it is equivalent to predicting 100% certainty/confidence. Theoretically, all phenomena have unpredictability; predicting any behaviour with 100% confidence would not be possible. One might be able to predict certain behaviours with near 100% conviction, such as whether a person would stick their hand into a bucket of spitting rattlesnakes. Still, there is always the possibility that one of the participants in your sample could be a bit unconventional, so you need to factor in chance.

Whatever significance level a psychologist chooses, the results will never be free of chance. So, if a psychologist cannot predict a 0% chance level, what can they predict? Not many women would take the contraceptive pill if the chance of getting pregnant were 50%. What percentage of uncertainty should the scientist allow, or in statistic talk, “what should the desired strength of evidence be against the null hypothesis?”

SIGNIFICANCE LEVEL PERCENTAGES

The 0% significance level means that there is a 0% chance that the results occurred randomly. And a 100% confidence level that they didn’t occur by chance. No researcher chooses this.
The 1% significance level means that there is a 1% chance that the results occurred randomly. And a 99% confidence level that they didn’t occur by chance. 1% is also expressed as 0.01 or 1 chance in 100.
The 5% significance level means that there is a 5% chance that the results occurred randomly. And a 95% confidence level that they didn’t occur by chance. 5% is also expressed as 0.05 or 1 chance in 20.
The 10% significance level means that there is a 10% chance that the results occurred randomly. And a 90% confidence level that they didn’t occur by chance. 10% is also expressed as 0.10 or 1 chance in 10.

TYPE ERRORS

Type I and II errors

Type I error: you conclude that 10 minutes of meditation reduces stress when it doesn’t.
Type II error: you conclude that spending 10 minutes of meditation doesn’t affect stress when it does.

TYPE I ERROR

If a researcher opts for a 10% significance level in a test, they aim to predict outcomes with 90% certainty and allow for a 10% chance of error. However, accepting an experimental hypothesis at this significance level implies only 90% certainty, which raises concerns about credibility. Would you board an aeroplane if there was a 10% chance that 10 out of 100 passengers wouldn't make it?

Why do some researchers choose a 10% significance level when it lacks integrity? Would the general public find such a high chance margin acceptable for evaluating a new drug? Some psychologists may choose this level because it's easier to prove. For instance, a researcher might explore a new area and conduct preliminary research to assess differences or associations between variables, such as investigating whether vaping causes vivid dreams.

However, when a researcher sets a 10% significance level, they're not aiming to prove results at the most stringent level, as this might overlook nuances in the data. Opting for a more stringent significance level ensures that the test is highly sensitive to true effects, including very subtle ones, such as vivid dreams in a percentage of vape smokers. Conversely, setting the chance bar too low may result in statistically significant findings with little real-world usefulness. For example, if the significance level is set at 1%, the link between smoking and cancer might go undetected.

However, the 10% significance level is problematic because there is a 10% chance the researcher will end up not rejecting their null hypothesis when they should have - because the significance level was not stringent enough. In other words, If the significance level is set at 10%. This means that if 100 studies are conducted, only 90% of the results will be true results, and 10 per cent will happen by chance. This is known as a type I error.

A type I error occurs in situations where a researcher will have concluded that the results are statistically significant when, in fact, they are not - this can also be referred to as a false positive.

In simpler terms, the same principle applies to everyday situations with low significance, like birth control pills. For instance, if the confidence level for not getting pregnant while on birth control pills is around 90% (or 99% if used perfectly), it means that out of 100 women, about ten could still get pregnant while using the pill. When contraception fails, it's obvious that a mistake occurred because a woman becomes pregnant. However, researchers won't realize they made a mistake until they redo the study and get different results or notice that their test value is significantly different from the standard values of 5%, 10%, or even 1% (but more on that later).

Here are some other examples of false positives:

Lateral flow tests: A lateral flow test with a 90% accuracy rate means that ten per cent of test takers will falsely show positive results. This could lead to individuals being mistakenly identified as positive for a certain condition or disease.

Legal Trials: In a legal trial, such as the trial of an accused criminal, a type I error would mean that the person is not found innocent and is sent to jail despite actually being innocent. This can have severe consequences for the wrongly accused individual.
Western Blot test for Lyme disease: The Western Blot test detects tick-borne bacterial infections like Lyme disease. It has a false-positive prevalence of 4.8%, meaning that 4.8% of people test positive for Lyme disease when they don’t have it. This can lead to unnecessary treatments or anxiety for individuals who receive false positive results.

TYPE II ERROR

If a researcher chooses a 1% per cent significance level test, they want to predict with 99% certainty and a 1% chance rate.
If an experimental hypothesis is accepted with a 1% significance level, it is a highly credible result as it means 99% certainty.

SCENARIO 2

Imagine you have been stuck on a mountain for three days. Miraculously, three rescuers appear on the evening of day 3.

Q. Which one do you choose?

I hope you opted for Ivan the nihilist despite his intimidating presence since he would most likely rescue you. Similarly, the 1% significance level carries substantial credibility, so why don't all researchers opt for it? Would the public be content with anything less stringent? Who would prefer the 10% error margin akin to 'Mountain Mick's rescue operations?

Many psychologists steer away from the 1% significance level because it's not just exceptionally challenging to achieve but also overly stringent, making it difficult to detect nuanced differences in the data.

A good analogy for these points is dart players.

SCENARIO 3

Imagine you are the new team leader of a local darts team. There is an important league game next week, and you want to show off your leadership skills to the existing players. Unfortunately, five members can’t make the league game as they have been stuck on a mountain for three days in the French Alps. In desperation, you advertise for new players. You decide the criterion for membership is players who can get ninety-nine bull’s eyes in a row. One hundred people apply, but sadly, nobody passes this benchmark, and you fail to find new players and have to withdraw from the league. People assume you are a terrible team leader, so you resign with immediate effect and move to the Outer Hebrides.

The reality was that 25 applicants were professional dart players and thus extremely good. But the test was impossibly hard - even for them. You failed to find players because the bar was set way too high. This is known as a type II error. Similarly, If a researcher sets a 1% significance level, it may be too severe to prove. In other words, they accepted the null hypotheses because their criteria were impossibly high.

A type II error occurs when a researcher incorrectly accepts the null hypothesis, failing to recognize the validity of the experimental or alternative hypothesis. This mistake leads to a false negative outcome, where significant findings are dismissed as non-significant.

The challenge with type II errors is that they often go unnoticed until the study is replicated or if further analysis reveals that the observed data points are near the accepted critical values. For example, after the challenge, you realise that even though one player called Alex didn't hit the 99th bull's eye, his ability to hit 98 in a row was an extraordinary feat that almost no one else could achieve. This moment of realisation comes when you witness another competition where the standard is set more realistically, and you see Alex outperforming many others who couldn't come close to his previous mark of 98 bull's eyes.

In this context, recognizing the type II error comes from seeing Alex's performance in a different light and understanding that your original criterion was unrealistically high. The error becomes evident when you observe that Alex's skill level significantly surpasses what is typically expected, even though it fell just short of your original, overly stringent requirement.

Similarly, in statistical terms, you might realize a type II error has been made when subsequent analysis or additional studies show that the effect you deemed non-significant (due to not meeting a very low p-value threshold) has practical significance or when you observe that the results were very close to your significance cutoff, suggesting that a slight adjustment in your criteria (e.g., a more lenient significance level) could have led to a different conclusion.

Thus, recognizing a type II error often involves hindsight—seeing the results in a new context or with a broader perspective highlighting the practical significance of the initial dismissal.

This realisation highlights the delicate balance required to set significance levels to accurately detect true effects without overly restrictive..

FALSE NEGATIVES
A false negative arises when a test incorrectly indicates the absence of a condition or attribute that is present. For example, if you take a pregnancy test that shows you're not pregnant when you are, that's a false negative. Reasons for a false negative in this context might include conducting the test prematurely, diluting urine, a malfunctioning test kit, or reading the results too quickly. This issue isn't confined to medical or physiological tests, such as those for pregnancy or COVID-19, where a person might test negative despite being infected; it extends into various other domains:

In manufacturing quality control, a false negative would occur if a flawed product mistakenly passes inspection and is deemed safe or meets quality standards, potentially due to oversight or inadequate testing processes.
Within the justice system, a false negative happens if someone guilty is acquitted and deemed not guilty, perhaps because of insufficient evidence, witness testimony issues, or legal technicalities.

These examples highlight false negatives' broad impact and risk, underscoring the importance of accurate testing and decision-making processes across different fields.

THE FIVE PERCENT SIGNIFICANCE LEVEL

Type errors can occur at any significance level, illustrating the inherent risk in statistical testing. For example, Type I errors, which involve falsely rejecting a true null hypothesis, can still occur at a stringent 1% significance level. This is analogous to the scenario with contraceptive pills: even when taken correctly, with a 99% efficacy rate, there's still a 1% chance of pregnancy, highlighting the possibility of error despite high confidence levels.

The choice of significance level significantly influences the probability of encountering Type I or Type II errors. A more lenient p-value, like 0.10, raises the chance of a Type I error, leading researchers to assert significance in their results incorrectly. Conversely, stricter levels like 1% can predispose researchers to Type II errors by overlooking genuine effects.

This dynamic is why many psychologists prefer a 5% significance level, positioning it as a middle ground between being too lenient and strict. The 5% threshold is chosen to reduce the likelihood of both Type I and Type II errors, aiming for a balanced approach that mitigates the risk of making either error, thus enhancing the reliability and validity of research findings.

CHOOSING A SIGNIFICANCE LEVEL

Consequences of Increasing the Significance Level: Imagine you’re testing the strength of bin bags and choosing the 10 per cent significance level. You’ll use the test results to determine their strength. A false positive here leads you to endorse bin bags that are not stronger. The drawbacks of a type I error here are very low because poor-quality bin bags don’t generally cause harm. So, you increase the evidence you need by changing the significance level to 1%. Because this change increases the required evidence, it makes your test more sensitive to detecting differences and increases the chance of type II errors. However, your bin bags never go on the market because they can't pass the stringency test. To avoid going bankrupt, you opt for a five per cent significance level to balance the risk of making type I and II errors.

Consequences of Decreasing the Significance Level:

Conversely, imagine you’re testing the success of a new antidepressant. A type I error here is risky because people’s mental well-being is on the line! You want to be very confident that the antidepressant from one manufacturer is better than the other. In this case, you should increase the evidence required by changing the significance level to 1%. Because this change increases the required evidence, it makes your test less sensitive to detecting differences and decreases the chance of type I errors, but to you, the risk is worth it as people could die. It’s all about the trade-off between sensitivity and false positives! The smaller the significance level p, the more stringent the test, and the greater the likelihood that the conclusion is correct. Unless research is socially sensitive or threatens the safety of an individual, most researchers opt for a 5 per cent significance level because it strikes a balance between being stringent enough to provide reliable results and being practical enough to avoid missing potentially important findings.

FACTORS THAT DETERMINE THE CHOICE OF A SIGNIFICANCE LEVEL

Sample Size: A "too small" sample size could be, for example, studying the effect of a new educational method on math performance with only ten students. This is considered small because it may not accurately represent the population's variability, leading to higher uncertainty in the results. Researchers might choose a less stringent significance level (e.g., 10%) to avoid missing a potentially real effect due to this variability.
Estimated Size of the Variable Being Tested: This refers to the size of the expected change or effect. For example, if a new drug is expected to lower blood pressure slightly, the effect size is small. Researchers might choose a significance level that balances the need to detect small, meaningful changes without being overly strict, considering the practical significance of the findings.
Newness of Research: Emerging fields, like the study of gut health's impact on mental health, may have less existing research to build upon. In such cases, a 10% significance level might encourage exploration and discovery, acknowledging that early findings can guide future, more detailed research.
Non-directional (2-tailed) Hypotheses: If researchers are studying a new therapy's effect on depression without predicting whether it will increase or decrease symptoms, they're using a non-directional hypothesis. A 10% significance level might be applied here to remain open to detecting any significant effect, regardless of its direction.
Existing Research Support: Well-established findings, such as the decline in memory function with age, might lead researchers to use a more stringent 1% significance level in studies exploring this phenomenon further. This higher standard ensures that new findings truly add to the existing body of evidence.
Conflicting Evidence: When exploring theories with mixed evidence, such as the effectiveness of serotonin-enhancing antidepressants being no better than placebos, a stricter 1% level could be employed to rigorously test these claims and provide clearer conclusions against the backdrop of debate.
Social Sensitivity: Studies on sensitive topics, like the distinction between biological sex and gender identity, may adopt a 1% significance level to ensure the results are robust and can withstand societal scrutiny or backlash, recognizing the potential for widespread impact.
Controversial Nature of Research: Research proposing that criminal behaviour is determined (thus questioning free will) or that ADHD is primarily caused by early environmental factors might also opt for a 1% level to ensure findings are solid enough to challenge established views or provoke thoughtful discussion.
Implications for Well-being (Safety): In the case of developing new drugs or vaccines, such as those for COVID-19, a 1% significance level is often chosen to minimize the risk of harm and legal repercussions, reflecting the high stakes of accurately determining efficacy and safety.
Minimizing Risk of Errors: The general preference for a 5% significance level in many studies strikes a balance, aiming to reduce the likelihood of mistakenly seeing an effect that isn't there (Type I error) or missing a real effect (Type II error).
Directional (1-tailed) Hypotheses: If a study hypothesizes that a specific intervention will improve (not decrease) test scores, reflecting a specific, predicted direction of effect, a 10% significance level might be considered sufficient to explore this targeted hypothesis.
Setting the Chance Bar Appropriately: Choosing a 1% significance level for examining the link between cannabis use and psychosis might be too stringent, potentially leading to underreporting of the effect and suggesting a lack of association that simplifies the nuanced reality of risk.

DEDUCTIVE REASONING

In many aspects of life, we often rely on our past experiences to set standards for evidence. For instance, if someone has bitten us before, we may avoid them. This process is known as inductive reasoning and was the primary approach to research before Karl Popper introduced deductive reasoning.

However, scientists operate differently, using deductive reasoning where evidence must be prospective and established before the actual study occurs. This means researchers need to predict the level of certainty expected for the evidence they seek. Consequently, the significance level is predetermined by the researcher.

But like any bet, making predictions carries the risk of failing - or, more specifically, “not being able to reject your null hypothesis; remember, the result of a hypothesis test depends on whether the null hypothesis is rejected.

LASTLY REMEMBER THE FOLLOWING STATEMENTS ARE ARE ALL EQUIVALENT TO EACH OTHER.

I have only used the 5% level as an example, but any alpha level can be used.

The finding is significant at the 0.05 level.
The confidence level is 95%
The Type I error rate is 0.05.
The alpha level is 0.05.
α = 0.05.
There is a 1 in 20 chance of obtaining this result (or one more extreme).
The area of the region of rejection is 0.05.
The p‐value is 0.05.
p = 0.05.

POSSIBLE EXAM QUESTIONS

What level of significance is accepted as standard in psychological research? (1 mark)
Define a type I error. (2 marks)
Exam hint: Full marks can be achieved for this question by stating that the null is rejected and the experimental hypothesis accepted when, in fact, results are due to chance and are most likely to happen when the level of significance has been set too leniently.
A psychologist found that their results were significant at p<0.05. What does ‘the results were significant at p<0.05’ mean? (2 marks)
Explain the difference between a type 1 and a type 2 error. (4 marks)
Explain the difference between a calculated value and a critical value. (3 marks)
Explain what is meant by the phrase “not statistically significant at the 10% level.” 2 marks
How does the set significance level affect the chance of researchers getting a type 1 or type 2 error? 4 marks
Explain what is meant by “p=≤0.05”. 2 marks

ANSWERS

The standard significance level accepted in psychological research is typically 0.05 or 5%.
A type I error occurs when the null hypothesis is incorrectly rejected and the experimental hypothesis is accepted, even though the results are due to chance. This error is more likely to happen when the significance level is too lenient.
When a psychologist's results are significant at p<0.05, there is less than a 5% probability that the observed results occurred by chance alone.
Type I error (false positive) occurs when the null hypothesis is wrongly rejected. In contrast, Type II error (false negative) occurs when the null hypothesis is incorrectly accepted when it should have been rejected based on the data.
A calculated value is the outcome of a statistical test based on the sample data. In contrast, a critical value is a threshold value obtained from statistical tables or formulas used to determine whether to reject the null hypothesis.6. If something is statistically significant, it means it did not occur by chance. Your study worked, so you can accept your experimental/alternative hypothesis and reject your Null hypothesis. If it is not significant at the 10% level, your results are more than ten per cent due to chance.
If psychologists choose a ten per cent level of significance, then they have a greater chance of making type one errors. This is because there is a ten per cent probability that results could be due to chance. Therefore, psychologists may accept their experimental hypothesis when they should reject it (they won’t know this until they replicate it). If psychologists go to the other extreme and choose a one per cent significance level, they will have a greater chance of making type two errors. This is when the significance level is set too high, and psychologists reject their experimental hypothesis instead of accepting it (again, not known until replication). Therefore, psychologists usually choose five per cent, as it is midway between making type 1 and type 2 errors.
P=≤0.05 means that the research set her significance level at 5%. This means that if null is rejected, the researcher can only be 95% certain her results did not occur by chance.

TESTS OF DIFFERENCE, CORRELATION AND ASSOCIATION?

THE NEXT STEP IN DECIDING WHICH INFERENTIAL TEST TO USE IS WHETHER YOU NEED A TEST OF DIFFERENCE, ASSOCIATION OR CORRELATION.

TESTS OF DIFFERENCE

Tests of difference are for all experiments:

Laboratory
Field
Quasi
Natural

Tests of difference are for research that tests a difference between conditions (IVs) or participants.

Tests of difference apply to various types of experiments, including laboratory, field, quasi, and natural experiments. These tests are designed for research to detect disparities between conditions (independent variables or IVs) or participants.

To determine if a test of difference is necessary, consider the following questions:

Are participants engaged in distinct activities or conditions? For instance, are there at least two conditions (IVs) to which participants are randomly allocated, as seen in laboratory and field experiments? Does the research aim to discern disparities in outcomes across these conditions or groups?
Are two groups of unrelated participants engaged in one activity or condition? For example, different sets of participants (IVs) experience one condition.

Examples of Quasi tests of difference include:

The difference in conformity scores between religious and non-religious participants.
The difference in IQ scores between males and females.

TESTS OF ASSOCIATION

Tests of the association are needed for non-experimental research, e.g., observations, content analysis, thematic analysis, interviews, questionnaire surveys, and case studies, but only if the data is nominal.
Association tests are needed when researchers are looking for an association between variables.
Tests of association are never used for correlations or experiments.
Tests of association are needed when researchers are simply counting the frequency at which a behaviour occurs, e.g., tallies about a discreet set of variables. For example, if you wanted to know whether boys or girls play on main roads most frequently.

EXAMPLE ONE: Naturalistic observation of children playing outside to determine who plays on main roads most frequently, boys or girls

Nominal data: gaps between the behavioural categories are non-mathematical and cannot be ordered.
Test of association: Looking for an association between variables
Counting the frequencies of certain behaviours

EXAMPLE TWO: CONTENT ANALYSIS REASONS FOR SMOKING INITIATION IN TEENS.

Nominal data: gaps between the behavioural categories are non-mathematical and cannot be ordered.
Test of association: Looking for an association between variables
Counting the frequencies of certain behaviours

OTHER EXAMPLES OF TEST OF ASSOCIATION CONTINGENCY TABLES WITH TALLIES

EXAMPLE THREE

NATURALISTIC OBSERVATION OF A CHILD’S BEHAVIOUR IN CLASSES.

NOMINAL DATA: AS GAPS BETWEEN THE BEHAVIOURAL CATEGORIES IS NON-MATHEMATICAL AND CANNOT BE ORDERED.

TEST OF ASSOCIATION:

LOOKING FOR AN ASSOCIATION BETWEEN VARIABLES
COUNTING THE FREQUENCIES OF CERTAIN BEHAVIOURS

TESTS OF CORRELATION

When conducting non-experimental research to examine the relationship or link between two variables, such as the correlation between temperature and ice lolly purchases, correlations are the appropriate statistical tool.
Correlations exclusively assess the relationship between two variables, with one variable plotted on the X-axis and the other on the Y-axis, resulting in 1x1 designs.
Correlations require continuous data; they cannot be applied to categorical data.
In many cases, correlations involve data collected from the same individuals, leading to inferential tests such as Spearman’s rank and Pearson’s product, where a single N represents two sets of variables derived from one individual. When it is non-experimental, and you are testing a relationship/link between two variables, for example, the hotter you are, the more ice lollies you buy.

HOW TO TELL IF YOU NEED A CORRELATION

WHICH OF THE HYPOTHESES BELOW IS A CORRELATION?

Male participants aged 15-24 will smoke less than male participants aged 25- 34.

Older men smoke more than younger men.

WHAT KIND OF RESEARCH METHOD DOES EACH STUDY NEED?

The first hypothesis, "Male participants aged 15-24 will smoke less than male participants aged 25-34", suggests a comparison between different age groups rather than a direct relationship between age and smoking behaviour. This scenario is better suited for a quasi-experimental design rather than a correlation analysis.

The second hypothesis, "Older men smoke more than younger men", implies a correlation analysis where age is treated as a continuous variable.

Identify the research method:

Since both hypotheses involve age, determining the type of study they need can be unclear. When age is treated as a discrete variable, parameters are defined, and age groups are treated as categories, making quasi-experimental designs more suitable. Conversely, correlations are more appropriate when age is treated as a continuous variable since age is viewed as a continuum. This differentiation helps select the appropriate research design based on how age is conceptualised in the study..

Also consider the number of variables:

Quasi-experiments typically involve three variables: the independent variable (e.g.,age groups such as those aged 15-24 and those aged 25-34), the dependent variable (e.g., smoking behaviour), and potentially a third variable. The age groups are treated as the independent variable, and smoking behaviour is the dependent variable. Therefore, this aligns with a quasi-experimental design.
On the other hand, correlation designs typically involve examining the relationship between two continuous variables. When age is defined as a continuous variable (e.g., with no discrete categories such as age 15-24 or 25-34), it fits the criteria for a correlation analysis. As a good rule of thumb, correlations involve two variables, while quasi-experimental designs involve a minimum of three.

Beware of intraclass correlations:

An intraclass correlation (ICC) is a statistical method to assess the degree of agreement or similarity among observations made on the same subjects, groups, or clusters. It measures the consistency or reliability of measurements made within the same group.
For example, consider a study comparing the IQ scores of older and younger siblings within the same families. In this scenario, each family represents a cluster, and the IQ scores of siblings within each family are compared. The ICC would assess the extent to which IQ scores within the same family are similar or correlated.
The confusion between ICC and quasi-experimental designs arises because both involve comparing groups of individuals. However, the key distinction lies like the relationship between participants.
In quasi-experimental designs, participants are typically unrelated individuals assigned to different groups or conditions based on pre-existing characteristics or criteria. For example, comparing the IQ scores of individuals from different age groups or educational backgrounds would constitute a quasi-experimental design.
On the other hand, in ICC studies, participants are related or clustered in some way, such as family members, students within the same classroom, or patients within the same healthcare facility. The focus is on examining the agreement or similarity of measurements within these related groups.

Age can be a continuous variable or a discreet variable. If age is a continuous variable, then it will be a correlation. For example, “The older you are, the more you smoke?” In correlations, age must be displayed continually on either the X or Y axis.

But if age is expressed as discreet variable (in categories as nominal data) then it will be a quasi-design, for example, “Male participants aged between 15- 24 will smoke less than male participants aged 25- 34.

Lastly, unless it is an intraclass correlation, both sets of data will come from the same person so if it comes from two sets of people chances are it is a quasi-experiment. AQA for example does not usually give intra-class correlations as examples in questions.

INTRACLASS CORRELATIONS AND CONCORDANCE RATES.

Intraclass correlations can sometimes be mistaken for quasi-experiments. Students often get confused if research compares two groups of people.

The trick to deciding between quasi and intraclass correlations is to investigate the relationship between the two groups. If the two groups are related/connected in some way, e.g., siblings, spouses, friends, colleagues, observers etc. then you should use an intraclass correlation.

If the two groups have no relationship/connection to each other and are in fact strangers, then it is a quasi-design because you cannot relate two pairs of scores from participants if they are not connected too each other.

EXAMPLES OF INTRA-CLASS-CORRELATIONS

WHAT INFERENTIAL TEST SHOULD YOU CHOOSE?

You should now know how to work out the following 5 things about your data and research to select the right inferential test.

CRITERIA NEEDED TO CHOOSE AN INFERENTIAL TEST:

Do you understand the level of significance, e.g., 001, 0.05, or 0.10 (1%, 5%, 10%)?
Do you understand what level of measurement means (e.g. nominal, ordinal, interval or ratio?

3. Do you understand the differences between directional (1-tailed) or non-directional (2-tailed) hypotheses?

4. Do you know how to distinguish between difference, association or correlation tests?

5. If it’s a test of difference, do you know how to choose the appropriate design, e.g., Independent Group, Matched Pairs or Repeated Measures?

If you ticked yes for each of the above criteria, try answering the questions below. These questions will test how well you can apply the concepts behind inferential testing.

QUESTIONS ON RESEARCH

It's now time to find out if you can apply what you have learned to the following research design questions.

If you can answer the following questions, you should have no problems choosing an inferential test.

Is the research an experiment or a non-experiment?
What is the research method?
If there is a design needed, what is it?
What level of measurement is the data?
What is the IV, variables or co-variables?
What is the DV (if applicable)
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)?
Test of difference, association or correlation?
What inferential test is needed?
Write an aim
Write a hypothesis.
Suggest a suitable significance level.

RESEARCH SCENARIOS

Researchers want to replicate a study on the difference between male and female estimates of stopping distances.
Ainsworth’s strange situation was where observers had behavioural categories to observe and tick each time they were observed.
Researchers believe that siblings’ aggression levels will have a similar relationship. Siblings are each given an attitude scale to rank their aggression, e.g., “On a scale of 1-10, how aggressive are you?”
Researchers analyse “Lonely Heart” advertisements to investigate sexual selection theory. They hypothesise that male participants will advertise “status” more frequently than female participants and that female participants will advertise ‘looks” more frequently than male participants.
In a company, disabled and able-bodied participants were asked to indicate on a scale of 1-7 how much they felt in control of their working environment.
Identical twins are split into condition A, which sets a puzzle they can solve, or condition B, which sets an unsolvable puzzle. After thirty minutes in either condition, their stress level is measured with an attitude test, such as: on a scale of 1-5, how stressed are you? It is thought that participants who could solve the puzzle would be less stressed.
One group of participants are given an IQ test and then asked to take multivitamins for a month, and then their IQ is measured again. It is thought that participants who can take vitamins will be more intelligent.
Researchers want to find out if there is a difference between the happiness ratings of academic and non-academic pupils.
Researchers are investigating the effect of memory on age. They wanted to see if those aged 20-40 have poorer memories than those aged 40-60, so they administered a digit span memory test to both groups.
It is hypothesised that listening to music with aggressive lyrics increases the heart rate. Participants listen to either music with aggressive lyrics or non-aggressive lyrics whilst their heart rate is measured.
It is hypothesised that caffeine causes memory problems. Scores on a memory test are taken before and after taking caffeine pills.
It is hypothesised that high levels of testosterone increase risk-taking. As finger length indicates how much testosterone a male has been exposed to, the ratio between the second and fourth digits of male participants' fingers was measured. This was then measured against scores on a scale measuring risk-taking behaviour.
How many units of alcohol per week are consumed by males and females?
Pictures of married couples were taken. Female Participants are asked to rate the attractiveness of the male spouse, and male Participants are asked to rate the attractiveness of the female spouse. It was thought that couples would have similar levels of attractiveness.
Female and male participants are asked to choose which female body shape they prefer, sizes 6, 8, 10, 12, 14, 16, 18, or 20. The sizes were exact.
16. Participants from Western and non-Western societies are asked to complete the Social Readjustment Rating Scale (SRRS - Holmes & Rah) and calculate their scores.
School students were observed choosing snacks during breaks. The snack choices were either apples or crisps. The next morning in school assembly, the same students were given the nutritional value of apples vs. crisps. They were then observed again to see whether they chose between apples or packets of crisps at break time.
A researcher wanted to see if older siblings were more intelligent than younger siblings—all siblings filled in an IQ test.
Children who lived in homes without gardens and children who lived in homes with private gardens were observed to see whether they chose to play outside in the street or stay at home.
Participants are either put in a jogging or non-jogging condition or asked to rate pictures of the opposite sex on a scale of 1-10. This is new research
Research suggests that the antioxidants in foods such as blueberries can reduce age-related declines in cognitive functioning. To test this, a researcher selected 25 adults and administered a cognitive function test to each participant. The participants then drink a blueberry supplement daily for four months before they are tested again.
To examine the connection between alcohol consumption and birth weight, a researcher selects a sample of 20 pregnant rats and mixes alcohol with their food for two weeks before the pups are born. Another group of 20 pregnant rats was compared.
To examine how texting affects driving skills, a researcher uses orange traffic cones to set up a driving circuit in a parking lot. A group of students is then tested on the circuit, once while receiving and sending text messages and once without texting and receiving messages. The researcher records the number of cones hit while driving each circuit for each student.
The more people exercise, the lower their blood pressure.
A statistics instructor thinks that doing homework improves scores on exams. To test this hypothesis, she randomly assigns students to two groups. One group must work on the homework until all problems are correct, while homework is optional for the second group. Exam grades are compared between the two groups at the end of the semester.
Researchers are investigating the effect of memory on age. They want to see if older children have poorer memories than younger students, so they administer a memory test to both groups.

ANSWERS

Researchers want to replicate a study on the difference between male and female estimates of stopping distances.

Is the research an experiment or a non-experiment? Experiment.
What is the research method? Between groups quasi-experiment.
If there is a design needed, what is it? N/A
What level of measurement is the data? Ratio
What is the IV, variables or co-variables? Gender (M+F)
What is the DV (if applicable) Stopping distance
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Directional (1-tailed) because they wanted a higher confidence level.
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Unrelated T-test
Aim: To replicate a study investigating the difference between male and female estimates of stopping distances.
Hypothesis: There will be a difference in stopping distance estimates between males and females
Suggest a suitable significance level.

2. Ainsworth’s strange situation where observers had behavioural categories to observe and tick each time they are observed.

Is the research an experiment or a non-experiment? Non-experiment.
What is the research method? Structured observation
If there is a design needed, what is it? N/A
What level of measurement is the data? Nominal – categories/frequency of occurrence.
What is the IV, variables or co-variables? Variables: attachment type (e.g., secure, insecure, disorganised etc.
What is the DV (if applicable) N/A
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tailed).
Test of difference, association or correlation? Test of association
What inferential test is needed? Chi-square
Aim: To observe and categorize the behaviour of infants in various attachment scenarios.
Hypothesis: Infants will exhibit different behaviours in different attachment scenarios
Suggest a suitable significance level.

3. Researchers believe that siblings’ aggression levels will have a similar relationship. Siblings are each given an attitude scale to rank their aggression, e.g., “On a scale of 1-10, how aggressive are you?”

Is the research an experiment or a non-experiment? Non-experiment.
What is the research method? It is an intra-class correlation as it measures relationships between the results of inter-connected pairs of participants.
If there is a design needed, what is it? N/A
What level of measurement is the data? Ordinal – use of a scale.
What is the IV, variables or co-variables? Variables: (1x1) 1st sibling’s aggression score and 2nd sibling’s aggression score.
What is the DV (if applicable) N/A
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? One-directional (1-tailed).
Test of difference, association or correlation? Test of correlation.
What inferential test is needed? Spearman’s Rho
Aim: To explore the relationship between sibling presence and self-reported aggression levels.
Hypothesis: There will be a positive correlation between the number of siblings and aggression levels.
Suggest a suitable significance level.

4. Researchers analyse “Lonely Heart” advertisements to investigate sexual selection theory. They hypothesise that male participants will advertise “status” more frequently than female participants and that female participants will advertise ‘looks” more frequently than male participants.

Is the research an experiment or a non-experiment? Non-experiment.
What is the research method? Content Analysis
If there is a design needed, what is it? N/A
What level of measurement is the data? Nominal data
What is the IV, variables or co-variables? Variables: Gender (M+F)
What is the DV (if applicable) looks and status
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Directional (1-tailed).
Test of difference, association or correlation? Test of association.
What inferential test is needed? Chi-Square
Aim: To investigate sexual selection theory by analysing "Lonely Heart" advertisements.
Hypothesis: Male advertisers will mention "status" more frequently than female advertisers and female advertisers will mention "looks" more frequently than male advertisers.
Suggest a suitable significance level.

5. In a company, disabled and able-bodied participants were asked to indicate on a scale of 1-7 how much they felt in control of their working environment. New research

Is the research an experiment or a non-experiment? Experiment.
What is the research method? Between groups quasi-experiment.
If there is a design needed, what is it? N/A
What level of measurement is the data? Ordinal data
What is the IV, variables or co-variables? Disabled and able-bodied participants.
What is the DV (if applicable)? How much did they feel in control of their working environment? Operationalised as an attitude scale. About control.
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tailed).
Test of difference, association or correlation? Test of difference
What inferential test is needed? Unrelated T-test
Aim: To explore the perceived control over the working environment among disabled and able-bodied employees within a corporate setting.
Hypothesis: Disabled employees will report lower levels of perceived control over their working environment compared to able-bodied employees
Suggest a suitable significance level.

6. Identical twins are split into conditions: either condition A, where they are set a puzzle they can solve or condition B, where they are set an unsolvable puzzle. After thirty minutes in either condition, their stress level is measured with an attitude test, such as: on a scale of 1-5, how stressed are you? It is thought that participants who could solve the puzzle would be less stressed

Is the research an experiment or a non-experiment? Experiment
What is the research method? Laboratory experiment.
If there is a design needed, what is it? Independent group design.
What level of measurement are the data? Ordinal data.
What is the IV, variables or co-variables? A puzzle participants can or cannot solve.
What is the DV (if applicable) Stress, operationalised as an attitude scale?
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Directional (1-tailed).
Test of difference, association or correlation? A test of difference.
What inferential test is needed? Wilcoxon matched pairs.
Aim: To assess the impact of puzzle solvability on stress levels in identical twins.
Hypothesis: Identical twins given solvable puzzles will report lower stress levels compared to those given unsolvable puzzles
Suggest a suitable significance level.

7.One group of participants are given an IQ test and then asked to take multivitamins for a month then their IQ is measured again. It is thought that participants who can take vitamins will be more intelligent.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Laboratory experiment
If there is a design needed, what is it? Repeated measures
What level of measurement is the data? Interval data
What is the IV, variables or co-variables? IV: Multi-Vitamins being taken or not being taken.
What is the DV (if applicable) DV: Intelligence operationalised as IQ
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Directional (1-tailed)
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Related T-test
Aim: To examine the effect of a month-long multivitamin regimen on IQ scores.
Hypothesis: Participants taking multivitamins for a month will show an increase in IQ scores.
Suggest a suitable significance level.

8. Researchers want to find out if there is a difference between the happiness ratings of academic and non-academic pupils.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Between-group quasi-experiment
If there is a design needed, what is it? N/A
What level of measurement is the data? Ordinal data because the gaps between happiness scores is not mathematical.
What is the IV, variables or co-variables? IV gender
What is the DV (if applicable) Happiness ratings
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tailed)
Test of difference, association or correlation? A test of difference
What inferential test is needed? Man, Whitney U test.
Aim: To compare happiness ratings between academic and non-academic pupils.
Hypothesis: There will be a significant difference in happiness ratings between academic and non-academic pupils.
Suggest a suitable significance level.

9. Researchers are investigating the effect of memory on age. They want to see if those aged 5-10 have poorer memories than those aged 11-15, so they administer a memory test to both groups.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Between groups quasi-experiment
If there is a design needed, what is it? N/A
What level of measurement is the data? Interval data
What is the IV, variables or co-variables? IV: Age (old v young)
What is the DV (if applicable) DV Results from the memory test
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tail)
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Unrelated T-test
Aim: To investigate the effect of age on effector memory capabilities.
Hypothesis: Individuals aged 20-40 will have poorer effector memory than those aged 40-60.
Suggest a suitable significance level.

10.It is hypothesised that listening to music with aggressive lyrics increases heartbeat. Participants listen to either music with aggressive lyrics or non-aggressive lyrics whilst their heartbeat is measured.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Laboratory experiment
If there is a design needed, what is it? Independent groups
What level of measurement is the data? Interval/ratio data
What is the IV, variables or co-variables? IV: aggressive music & non-aggressive music
What is the DV (if applicable)? DV: aggression operationalised as heartbeat
What type of hypothesis is it: Directional (1-tailed) or non-directional (2-tailed)? Directional (1-tailed)
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Unrelated T-test
Aim: To determine how listening to music with aggressive lyrics affects heart rate.
Hypothesis: Music with aggressive lyrics will significantly increase the heart rate of listeners.
Suggest a suitable significance level.

11. It is hypothesised that caffeine causes memory problems. Scores on a memory test are taken before and after taking caffeine pills.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Laboratory experiment
If there is a design needed, what is it? Repeated measures
What level of measurement is the data? Ratio because the data will be an actual score of how many answers are correct and could have a true zero.
What is the IV, variables or co-variables? IV: caffeine pills or no caffeine pills
What is the DV (if applicable)? DV: memory operationalised as a memory test
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Directional hypothesis (1-tailed)
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Related T-test.
Aim: To explore the impact of caffeine consumption on memory performance.
Hypothesis: Caffeine consumption will lead to a decrease in memory test scores.
Suggest a suitable significance level.

12. It is hypothesised that high levels of testosterone increase risk-taking. As finger length indicates how much testosterone a male has been exposed to, the ratio between the second and fourth digits of male participants' fingers was measured. This was then measured against scores on a scale measuring risk-taking behaviour.

Is the research an experiment or a non-experiment? Non-experiment
What is the research method? Correlation
If there is a design needed, what is it? N/A
What level of measurement is the data? Interval or ratio because the data will be an actual quantity and could have a true zero.
What is the IV, variables or co-variables? Variables: 1x1 Finger ratio and risk-taking behaviour, operationalised as risk taking questionnaire.
What is the DV (if applicable) N/A
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Directional hypothesis (1-tailed): Positive
Test of difference, association or correlation? Test of correlation
What inferential test is needed? Pearson’s product
Aim: To examine the relationship between testosterone levels and risk-taking behaviour.
Hypothesis: Higher testosterone levels, as indicated by finger length ratio, will correlate with increased risk-taking behaviour.
Suggest a suitable significance level.

13. How many units of alcohol per week are consumed by males and females?

Is the research an experiment or a non-experiment? Experiment
What is the research method? Between-group quasi-experiment.
If there is a design needed, what is it? N/A
What level of measurement is the data? Interval or ratio because the data will be an actual quantity and could have a true zero.
What is the IV, variables or co-variables? gender (M+F)
What is the DV (if applicable) Amount drunk
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tailed)?
Test of difference, association or correlation? Test of difference
What inferential test is needed? Unrelated T-test.
Aim: To quantify weekly alcohol consumption among males and females.
Hypothesis: There will be a significant difference in the amount of alcohol consumed weekly between males and females.
Suggest a suitable significance level.

14. Pictures of married couples were taken. Female Participants are asked to rate the attractiveness of the male spouse, and male Participants are asked to rate the attractiveness of the female spouse. It was thought that couples would have similar levels of attractiveness.

Is the research an experiment or a non-experiment? Non-experiment
What is the research method? Correlation:
If there is a design needed, what is it? N/A
What level of measurement is the data? Ordinal
What is the IV, variables or co-variables? variables: Male partner attractiveness and female partner attractiveness
What is the DV (if applicable) N/A
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Positive correlation = Directional (1-tailed)
Test of difference, association or correlation? Test of correlation
What inferential test is needed? Spearman’s Rho
Aim: To assess perceived attractiveness levels within married couples.
Hypothesis: Spouses within married couples will have similar levels of perceived attractiveness.
Suggest a suitable significance level.

15. Female and male participants are asked to choose which female body shape they prefer, sizes 6, 8, 10, 12, 14, 16, 18, or 20. The sizes were exact.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Between-group quasi-experiment
If there is a design needed, what is it? N/A
What level of measurement is the data? Interval or ratio because the data will be exact sizes.
What is the IV, variables or co-variables? (M+F)
What is the DV (if applicable) preferred body shape operationalised as choosing a shape from a series of pictures?
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tailed)
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Unrelated T-test
Aim: To investigate preferences for female body shapes among participants.
Hypothesis: There will be a gender difference in the preferred female body shapes.
Suggest a suitable significance level.

16. Participants from Western and non-Western societies are asked to complete the Social Readjustment Rating Scale (SRRS - Holmes & Rah) and calculate their scores.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Between-group quasi-experiment
If there is a design needed, what is it? N/A
What level of measurement is the data? Interval – use of a scale.
What is the IV, variables or co-variables? Western or non-western participants
What is the DV (if applicable) DV stress operationalised by taking the SRRS
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tailed
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Unrelated T-test
Aim: To compare stress levels measured by the SRRS between participants from Western and non-Western societies.
Hypothesis: There will be a significant difference in stress levels between participants from Western and non-Western societies.
Suggest a suitable significance level.

17. School students were observed choosing snacks during break. The snack choices were either apples or crisps. The next morning in school assembly, the same students were given the nutritional value of apples and crisps. They were then observed again to see whether they chose apples or packets of crisps at break time.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Field experiment
If there is a design needed, what is it? Repeated Measures design
What level of measurement is the data? Nominal data because the variables, apples and crisps, cannot be ordered, and there is no mathematical gap between them.
What is the IV, variables or co-variables? Before giving nutritional advice and after
What is the DV (if applicable)? The choice of apples versus crisps
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tailed
Test of difference, association or correlation?A test of difference
What inferential test is needed? The sign test.
Aim: To determine the effect of nutritional information on students' snack choices.
Hypothesis: Students will choose healthier snacks after receiving nutritional information.
Suggest a suitable significance level.

18. A researcher wanted to see if older siblings were more intelligent than younger siblings—all siblings filled in an IQ test.

Is the research an experiment or a non-experiment? Non-Experiment
What is the research method? Correlation
If there is a design needed, what is it? N/A
What level of measurement is the data? Interval data, as you can’t have a zero IQ
What is the IV, variables or co-variables? Birth order between siblings and their IQs.
What is the DV (if applicable) N/A
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Negative correlation = directional (1-tailed)
Test of difference, association or correlation? Test of correlation
What inferential test is needed? Pearson’s product.
Aim: To compare IQ levels between older and younger siblings.
Hypothesis: Older siblings will have higher IQ scores than their younger siblings.
Suggest a suitable significance level.

19. Children who lived in homes without gardens and children who lived in homes with private gardens were observed to see whether they chose to play outside in the street or stay at home.

Is the research an experiment or a non-experiment? Non-experiment
What is the research method? Naturalistic covert observation
If there is a design needed, what is it? N/A
What level of measurement is the data? Nominal
What is the IV, variables or co-variables? Both sets of variables are nominal.
There are two sets of two variables here:
Homes with & without gardens.
Playing outside or staying at home.
What is the DV (if applicable) N/A
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tailed)
Test of difference, association or correlation? Test of association.
What inferential test is needed? Chi-square.
Aim: To explore how a garden at home influences children's preference for playing outside.
Hypothesis: Children with access to a private garden are more likely to choose to play outside than those without.
Suggest a suitable significance level.

20. Participants are either put in a jogging or non-jogging condition or asked to rate pictures of the opposite sex on a scale of 1-10.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Laboratory experiment
If there is a design needed, what is it? Independent group design
What level of measurement is the data? Ordinal data as ratings can be ordered, but gaps are not mathematical
What is the IV, variables or co-variables? IV: Jogging or not jogging
What is the DV (if applicable) DV: Results of rating photographs of the opposite sex
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? non-directional (2-tailed)
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Man, Whitney U test
Aim: To examine the impact of jogging on the perceived attractiveness of the opposite sex.
Hypothesis: Participants in a jogging condition will rate pictures of the opposite sex higher than those in a non-jogging condition.
Suggest a suitable significance level.

21. Research suggests that the antioxidants in foods such as blueberries can reduce age-related declines in cognitive functioning. To test this, a researcher selected 25 adults and administered a cognitive function test to each participant. The participants then drink a blueberry supplement daily for four months before they are tested again.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Lab experiment
If there is a design needed, what is it? Repeated measures
What level of measurement is the data? Interval (can you get zero on cognitive functioning?).
What is the IV, variables or co-variables? IV: before and after blueberry supplement
What is the DV (if applicable) DV: Result on cognitive function test.
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Directional (1-tailed)
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Related T-test
Aim: To assess the impact of blueberry supplement intake on cognitive function.
Hypothesis: Daily consumption of a blueberry supplement over four months will improve cognitive function.
Suggest a suitable significance level.

22. To examine the connection between alcohol consumption and birth weight, a researcher selects a sample of 20 pregnant rats and mix alcohol with their food for two weeks before the pups are born. Another group of 20 pregnant rats was compared.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Lab Experiment
If there is a design needed, what is it? Independent groups
What level of measurement is the data? Interval
What is the IV, variables or co-variables? IV: Alcohol mixed in female rat food for two weeks or not
What is the DV (if applicable)? DV: birth weight of subsequent female rat’s offspring
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Directional (1-tailed)
Test of difference, association or correlation? Test of Difference
What inferential test is needed? Unrelated T-test
Aim: To investigate the effects of maternal alcohol consumption on the birth weight of rat pups.
Hypothesis: Rat pups born to mothers consuming alcohol will have a lower birth weight compared to those born to control mothers.
Suggest a suitable significance level.

23. To examine how texting affects driving skills, a researcher uses orange traffic cones to set up a driving circuit in a parking lot. A group of students is then tested on the circuit, once while receiving and sending text messages and once without texting and receiving messages. For each student, the researcher records the number of cones hit while driving each circuit

Is the research an experiment or a non-experiment? Experiment
What is the research method? Laboratory experiment
If there is a design needed, what is it? Repeated measures
What level of measurement is the data? Ratio data
What is the IV, variables or co-variables? Receiving and sending text messages or not receiving and sending text messages,
What is the DV (if applicable) Number of cones hit while driving each circuit
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tailed)
Test of difference, association or correlation? A test of difference.
What inferential test is needed? Related t test.
Aim: To evaluate how texting affects driving skills, as measured by the number of cones hit in a driving circuit.
Hypothesis: Drivers receiving and sending text messages will hit more cones than when not texting.
Suggest a suitable significance level.

24. The more people exercise, the lower their blood pressure.

Is the research an experiment or a non-experiment? Non-experiment
What is the research method? Correlation
If there is a design needed, what is it? N/A
What level of measurement is the data? Interval/Ratio
What is the IV, variables or co-variables? Exercise frequency and blood pressure.
What is the DV (if applicable) N/A
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Positive correlation = directional (1-tailed)
Test of difference, association or correlation? Correlation.
What inferential test is needed? Pearson’s product.
Aim: To examine the relationship between exercise frequency and blood pressure levels.
Hypothesis: Increased exercise frequency will be associated with lower blood pressure levels.
Suggest a suitable significance level.

25. A statistics instructor thinks that doing homework improves scores on exams. To test this hypothesis, she randomly assigns students to two groups. One group must work on the homework until all problems are correct, while homework is optional for the second group. Exam grades are compared between the two groups at the end of the semester.

Is the research an experiment or a non-experiment? Experiment
What is the research method? Laboratory experiment
If there is a design needed, what is it? Independent groups
What level of measurement is the data? In ratio data, you can get a zero in an exam.
What is the IV, variables or co-variables? IV: Working on homework until completion or not having the option to complete or not complete homework,
What is the DV (if applicable)? DV: End-of-term examination marks.
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Directional (1-tailed)
Test of difference, association or correlation? Test of difference
What inferential test is needed? Unrelated T-test.
Aim: To determine the effect of mandatory homework completion on final exam scores.
Hypothesis: Students required to complete all homework problems correctly will have higher final exam scores than those for whom homework is optional.
Suggest a suitable significance level.

26. Researchers are investigating the effect of memory on age. They want to see if older children have poorer memories than younger students, so they administer a memory test to both groups.

Is the research an experiment or a non-experiment? Non-experiment
What is the research method? Correlation
If there is a design needed, what is it? N/A
What level of measurement is the data? Interval data
What is the IV, variables or co-variables? IV: Age and results of a memory test
What is the DV (if applicable) DV Results from the memory test
What type of hypothesis is it: directional (1-tailed) or non-directional (2-tailed)? Non-directional (2-tail)
Test of difference, association or correlation? Test of correlation
What inferential test is needed? Pearson’s product.
Aim: To investigate memory performance differences between older children and younger students.
Hypothesis: Older children will exhibit poorer memory performance compared to younger students.
Suggest a suitable significance level.

CHOOSING AN INFERENTIAL TEST

You can now choose an inferential test and determine if the results are statistically significant. It might be helpful to think of inferential statistics as “chance calculating statistics”.

STEP 1:

What is the level of measurement of your data (e.g. nominal, ordinal, interval or ratio)?
Do you need a test of difference, association or correlation?
If you need a test of difference, what design are you using: Independent Group, Matched Pairs or Repeated Measures?
If your data is parametric or not?
Choose an inferential test :

OBSERVED CALCULATED VALUES:

STEP 2:

Conduct your inferential test.

Get the results of your inferential test.

It's crucial to recognise that in statistics, the outcome of an inferential test is not merely referred to as "the result." Instead, statisticians use specific terms for the result and call it the "observed value" or the "calculated value." The rationale behind the term "calculated value" is straightforward—it signifies that this result is obtained through a calculation process. On the other hand, the term "observed value" might not seem as mnemonic-friendly, lacking a direct associative memory aid. To bridge this gap, I found it helpful to link these terms together, adopting the phrase "I need to observe my calculated value" as a mnemonic strategy, emphasising that they essentially refer to the same concept, e.g., the actual result of your inferential test.

The observed or calculated value is written with a statistical test symbol before the observed /calculated value (result) so researchers know what test has been used.

For example:

Sign Test = S
Mann-Whitney U is = U
Unrelated T Test = t
Wilcoxon Matched pairs = T
Related T Test = t
Spearman’s Rho = rho
Pearson’s Product is = r
Chi-Square = x2

The statistical test symbol and observed/calculated value would look something like this: U = 80

HOW DO YOU GET AN OBSERVED/CALCULATED VALUE?

Official definitions:

Observed Value – The number produced after a statistical test's various steps and calculations have been carried out.
Critical values are cut-off values that define regions where the test statistic is unlikely to lie.

To obtain an observed/calculated value in statistical analysis, you perform the specific inferential statistical test relevant to your research question. This involves collecting data, applying the chosen test's formula, and calculating the value based on your data. This calculated value is the "observed" or "calculated" result of the test, indicating the outcome of your statistical analysis.

You won't be tasked with calculating an observed value in many exams—and for good reason. The process of computing inferential statistics is complex and often relegated to software programs like Minitab or SPSS. The essence of psychology isn't rooted in mastering statistical equations; rather, it's about interpreting the meanings behind calculations and results.

Understanding the significance of your observed/calculated value is crucial in exams and research. This involves discerning whether your findings are merely coincidental or statistically significant. To do this, you compare your calculated value to a critical value—a benchmark number found in probability tables, which are readily available and pre-calculated for your convenience. But how do you determine which critical value to use? Before we delve into that, let's clarify what critical values entail.

CRITICAL TABLE VALUES

In hypothesis testing, a probability table acts as a critical tool, offering us a way to gauge whether the outcomes of our statistical tests—the calculated values—happen merely by chance. Embedded within this table are "critical values," pre-determined numbers that function as key markers or thresholds.

Calculated Value: This is the outcome of your statistical analysis, the numeric result derived from applying your test to your collected data.

Critical Value in a Probability Table: To assess the significance of your calculated value, you reference a critical value that aligns with your research hypothesis and statistical test. This critical value acts as a benchmark, determining the boundary between statistically significant results (implying a low likelihood of occurring by chance) and those that are not. It's like a marker in the sand that says, "Cross this line, and your findings are too unusual to be just chance."

The Comparison: When your calculated value exceeds the critical value threshold, it signals that the phenomenon under investigation is unlikely to have occurred by chance. This allows you to confidently reject the null hypothesis, which assumes no real effect or difference. The moment your data surpasses this boundary, it implies that the effect observed in your study is significant enough to be acknowledged as a real occurrence rather than a result of random chance.

Essentially, the critical value from the probability table is a tool for assessing how exceptional your calculated value is. If your results cross this predefined limit, it's a strong indicator that what you've found holds statistical significance, reinforcing the notion that the effect or difference you're exploring exists in the population under study.

To choose the right probability table and critical table value, you need to complete steps 3 and 4.

STEP 3

Note: All tests need STEP 1 & STEP 2 above.

Critical values are found in tables that differ according to the nature of your study, whether your hypothesis is one-tailed or two-tailed, the significance level you're using, and how many participants are in your study. To pinpoint the precise table and critical value applicable to your research, it's essential first to ascertain these particular elements of your study. Below are the options you need to consider:

Study Focus: Define whether your research is exploratory, confirming an existing theory, or testing a new intervention.
Hypothesis Direction: Decide if your hypothesis is one-tailed (predicting a specific direction of effect) or two-tailed (predicting an effect without specifying the direction).
Significance Level: Which significance level did you choose? Significance levels are commonly set at 0.05. The significance level also determines the type of critical value you choose.
Participant Count: Know the total number of participants, as different tests may be required for different sample sizes. In critical value tables, “N” stands for number of participants.

You can only choose one option from the following three choices:

When the table indicates a single "N", the study utilizes a single group of participants across all conditions, as seen in repeated measures designs and correlational studies. Notably, in matched pairs designs, participants are paired so that for analytical purposes, they're considered to represent one collective group. The single "N" therefore signifies that the data is derived from related groups, which are treated as a singular entity for the analysis. This principle also applies to interclass correlational designs, where data is typically collected from one coherent set of participants. Inferential tests designed for this type of data include correlation tests like Spearman’s rho and Pearson’s product-moment correlation for assessing relationships, and tests such as the Wilcoxon matched pairs test, related T-tests, and Sign tests for analyzing repeated measures and matched pairs designs. The selection of a specific test is guided by the data's level of measurement, ensuring the most appropriate analytical approach is applied.

When a table displays "N1 & N2," it signifies that the participant groups in the experimental setup are independent of each other, as seen in independent group designs or quasi-experimental designs where the groups themselves are the independent variables (IVs). In this context, "N1" represents one group, while "N2" denotes the other group, highlighting their separation and lack of relation. Inferential tests suitable for analyzing data between these distinct groups include the Mann-Whitney U test for ordinal or non-normally distributed data and the unrelated T-test for normally distributed data. The selection between these tests is determined by the level of measurement used in the research, ensuring the statistical analysis is appropriately matched to the nature of the data.
Degrees of freedom (df) are applied in statistical tests involving nominal data, typically outside experimental designs. Unlike in experiments and correlational studies where df might relate to participant numbers, in analyses of nominal data, degrees of freedom reflect the number of independent variables or categories under examination. For instance, in observational studies or content analyses where categories are pivotal, df helps determine the test's capacity to estimate variability within the data accurately.. A chi-square is used with df. Lastly, degrees of freedom are worked out with the following sum: The number of rows minus one multiplied by the number of columns minus 1.
In simpler terms, degrees of freedom (df) is a statistical concept used to understand the flexibility or constraints in our data when we're analysing certain types of information, like categories or groups.
Imagine you're working with data with different categories, like colours of cars or types of fruits. Degrees of freedom tell us how many categories we can freely choose without being restricted by other factors.
For example, let's say we're looking at the colours of cars on the street, and we have three colours: red, blue, and green. If we know the total number of cars but only two of the colours, we can figure out the third colour because the total number of cars constrains it. That's degrees of freedom at work – it tells us the amount of wiggle room we have to make choices within our data.
In practical terms, degrees of freedom help statisticians determine the right statistical tests to use and how reliable the results are. They're like guardrails that keep our analysis on track, ensuring that we're making accurate conclusions based on the data we have

QUESTIONS:

What is meant by the term critical value?

How are degrees of freedom calculated?
For each test, find out how (df) or N or N1 and N2 are needed

Sign Test
Mann-Whitney
Unrelated T Test
Wilcoxon Matched pairs
Related T Test
Spearman’s Rho
Pearson’s Product
Chi-Square

STEP 4

Is your data parametric or non-parametric?

In parametric data, the pattern typically follows a specific mathematical distribution, such as a normal (bell-shaped) distribution. This means the data points are evenly spread around a central value, forming a symmetrical curve when plotted on a graph. Parametric data has certain properties, like a mean (average) and standard deviation, which are useful for statistical analysis.

On the other hand, non-parametric data doesn't follow a particular mathematical distribution. It may have irregular patterns, outliers, or skewed distributions that don't fit neatly into standard statistical models. Non-parametric data is often analyzed using methods that don't rely on specific distributional assumptions, making them more flexible for different data types.

To determine if data is parametric, consider the following criteria:

Nature of the Data: Parametric data typically consists of continuous variables (e.g., height, weight, temperature) that can be measured on a scale. Categorical or ordinal data (e.g., gender, Likert scale responses) are usually non-parametric.
Measurement Scale: Parametric tests are suitable for interval or ratio scale data, where the intervals between values are equal and meaningful. Non-parametric tests are more appropriate for nominal or ordinal scale data, where the values represent categories or rankings without consistent intervals.
Distribution: Parametric data often follows a normal distribution, meaning the data points are symmetrically distributed around the mean. To assess if the descriptive data is bell-shaped, visually inspect their data using histograms or quantile-quantile plots (QQ plots) to see if it resembles a bell-shaped curve.
Sample Size: Parametric tests are more robust with larger sample sizes. If the sample size is small (usually less than 30), opt for non-parametric tests to avoid parametric assumptions about the population distribution.
Homogeneity of Variance: Parametric tests assume that the variance (spread) of the data is consistent across groups or conditions—statistical tests like Levene's test to check for homogeneity of variance. If the variances are significantly different, it may indicate non-parametric data."Homogeneity of variance" refers to the assumption that the variability, or spread, of scores within each group or condition being compared is approximately equal across all groups. In simpler terms, it means that the data's variation is similar across different groups or conditions.
For example, if you compare test scores between two different teaching methods (Group A and Group B), homogeneity of variance suggests that the spread of scores in Group A is roughly the same as in Group B.
Homogeneity of variance is an important assumption for many parametric statistical tests, such as the t-test and analysis of variance (ANOVA). Violations of this assumption can lead to inaccurate results and affect the validity of the statistical analysis.
Researchers typically test for homogeneity of variance using statistical tests like Levene's test. Suppose the test indicates that the variances are significantly different between groups. In that case, it suggests that the assumption of homogeneity of variance has been violated, and alternative statistical approaches may be necessary.

FINALLY!!!

Once all the correct information is applied, researchers can compare their observed value with a critical value.

In certain tests, such as the Mann-Whitney U and Wilcoxon matched pairs, the observed value must be smaller than the critical value to achieve statistical significance, indicating that the results are not due to chance.
Conversely, in other inferential tests like Spearman’s Rho and Chi-Square, the observed value needs to exceed the critical value to be considered statistically significant, demonstrating that the findings are not random.
As a general guideline, experiments typically require the calculated value to be below the critical value, while non-experiments necessitate the calculated value to surpass the critical value. However, these rules are typically provided in critical table values, alleviating the need to memorise them. But see below for the different rules.

INFERENTIAL TESTS FOR TEST OF DIFFERENCE

Sign test: The Observed/calculated value is ‘S’. If the observed/calculated value of ‘U’ is equal to or less than the critical/table value, you can reject your null hypothesis and accept your experimental hypothesis. Your result is significant for independent group designs with nominal data.
Wilcoxon Matched Pairs Signed rank test: The observed/calculated value is T. If the observed/calculated value of ‘T’ is equal to or less than the critical/table value, you can reject your null hypothesis and accept your experimental hypothesis. Your result is significant for repeated measures and matched pairs group designs with ordinal data.
Man Whitney U test: The Observed/calculated Value is ‘U’. If the observed/calculated value of ‘U’ is equal to or less than the critical/table value, you can reject your null hypothesis and accept your experimental hypothesis. Your result is significant for independent group designs with ordinal data.
Related T-test: The observed/calculated value is t. If the observed/calculated value of ‘t’ is equal to or less than the critical/table value, you can reject your null hypothesis and accept your experimental hypothesis. Your result is significant for independent group designs with interval data.
Unrelated T-test: The observed/calculated value is t. If the observed/calculated value of ‘t’ is equal to or less than the critical/table value, you can reject your null hypothesis and accept your experimental hypothesis. Your result is significant for repeated measures and matched pairs group designs with interval data.

INFERENTIAL TESTS FOR CORRELATIONS

Spearman’s Rho: The observed/calculated value is rho. If the observed/calculated value of ‘rho’ is equal to or more than the critical/table value, you can reject your null hypothesis and accept your experimental hypothesis. Your result is significant. For links between variables when data is ordinal.
Pearson’s product: The observed/calculated value is r. If the observed/calculated value of ‘r’ equals or exceeds the critical/table value, you can reject your null hypothesis and accept your experimental hypothesis. Your result is significant. For links between variables when data is the interval

INFERENTIAL TESTS FOR TESTS OF ASSOCIATION

Chi-square: The observed/calculated value is X2. If the observed/calculated value of ‘X2’ is equal to or more than the critical/table value, then you can reject your null hypothesis and accept your experimental hypothesis. Your result is significant for frequencies or categories when data is nominal.

LET’S LOOK AT AN EXAMPLE EXAMINATION QUESTION

“A psychologist was interested in the effects of a restricted diet on memory functioning, and he expected memory to become impaired. The psychologist hypothesised that participants’ scores on a memory test are lower after a restricted diet than before a restricted diet. He gave the volunteers a memory test when they first arrived in the research unit and a similar test at the end of the four weeks. He recorded the memory scores on both tests and analysed them using the Wilcoxon signed ranks test.” The test was out of 100

The psychologist set the significance level at 5%.

The calculated value was T = 53.
N= 20

Q1: State whether the hypothesis for this study is directional or non-directional. (1 mark)

Q2: Using Table 1, state whether the psychologist’s result was significant. (3 Mark)

Q3: Explain your answer. (2 marks).

Q4: Name a statistical test appropriate for this investigation and give three reasons why it was appropriate to use this statistical test (4 Marks).

ANSWERS

Q1.Directional/one-tailed: as the psychologist specified, the results should follow a direction/tail, e.g., participants in the non-restricted food condition should get better scores on a memory test. One mark for correct answer – directional (one-tailed is acceptable)

Q2. Yes, the psychologist’s result was significant. One mark for correctly stating that the result is significant.(1 mark)

Q3: The critical value of T for N =20 for a one-tailed test where p ≤=0.05 is 60. As the observed/calculated value of T (53) is less than the critical/table value, the likelihood of my results occurring by chance is less than 5% (p ≤ 0.05). Therefore, I can reject my Null hypothesis and accept my experimental hypothesis. (Two marks)

Two further marks for an explanation: the calculated value of T =53 is less than the critical value of 60 where N = 20 and p ≤ 0.05 for a one-tailed test.
If the candidate states that the result is insignificant, no marks can be awarded.

Q4. A Wilcoxon matched pairs test was chosen because a test of difference was needed, the experimental design was repeated measures, and the level of data was interval as the differences between memory are mathematical.

To score points on the questions above, it's essential to cover four to five key elements as specified:

The name of the test
The type of test, such as difference, association, or correlation
The experimental design, if applicable
The level of measurement
The rationale for the chosen level of measurement

CRITICAL TABLE VALUES

FOR DATA TO BE STATISTICALLY SIGNIFICANT, THE CALCULATED VALUE MUST BE EQUAL TO OR GREATER THAN THE CRITICAL VALUE

QUESTIONS ON THE OBSERVED AND CRITICAL VALUE

Using your set of critical/table values tables, decide whether the alternative/experimental hypothesis should be accepted or rejected and why. For the first one, complete the gaps to understand how to write up your answers.

You can use the critical values tables above

Rho = 0.410 for a one-tailed test with a sample size of 20.
Rho = 0.5 for a two-tailed test with a sample size of 10.
Chi-square (χ^2) = 3.24 for a two-tailed test with a 2x2 contingency table.
Chi-square (χ^2) = 5.00 for a one-tailed test with a 3x2 contingency table.
U = 16 for a one-tailed test with 9 participants in one group and 8 in the other group.
U = 76 for a two-tailed test with 30 participants equally split between two conditions.
T = 54 for a one-tailed test with 25 participants.
T = 105 for a two-tailed test with 20 participants.

Below is how out the inferential results section in a research paper:

The critical value of T for N =20 for a one-tailed test where p ≤=0.05 is 60. As the observed value of T (29.5) is less than the critical/table value, the likelihood of my results occurring by chance is less than 5% (p ≤ 0.05). Therefore, I can reject my Null hypothesis and accept my experimental hypothesis.

FOR THE FOLLOWING QUESTIONS, PLEASE FILL IN THE FOLLOWING TEMPLATE.

At the ------ level of significance, the critical/table value for a--------tailed test, when ---------=-------is ---------Since the observed value of ----is ------------which is -------than the critical/table value, the ---------- hypothesis can be -----------and the ………hypothesis can be-----------------

Wilcoxon Matched Pairs, n=12, directional hypothesis at p=≤0.05, T = 27
Mann Whitney U test, non-directional hypothesis at p=≤ 0.10, N1 =17 and N2=15, U= 54
Chi-Squared, non-directional hypothesis at p=≤0.05, df=10, x2=22.42
Spearman’s Rho, n=25, non-directional hypothesis at p=≤ 0.10, r=0.511
Whitney U test, directional hypothesis at p=≤0.05, N1 =16 and N2=19, U=97
Chi-Squared, non-directional hypothesis at p=≤ 0.10, df=36, x2=50
Spearman’s Rho, n=11, directional hypothesis at p=≤0.05, r=0.421
Mann Whitney U test, directional hypothesis at p=≤0.05, N1 =20 and N2=20, U= 136
Chi-Squared, non-directional hypothesis at p=≤ 0.10, df=27, x2=45.78
Spearman’s Rho, n=19, directional hypothesis at p=≤0.05, r=3.9
Mann Whitney U test, non-directional hypothesis at p=≤ 0.10, N1 =12 and N2=13, U=39
Chi-Squared, non-directional hypothesis at p=≤0.05, df=14, x2=18.17

“A researcher was interested in finding out if watching TV affected creativity. The researcher gave 50 participants a creativity test and asked them to give up TV for six weeks. The participants were then tested again on another creativity test.” The researcher used a statistical test to determine whether there was a significant difference between the TV-watching and non-TV-watching conditions.

Name a statistical test that is appropriate for this investigation and give three reasons why it was appropriate to use this statistical test (4 marks)

ANSWERS

What is meant by the term critical value? A critical value is a threshold determined from a statistical table that defines the boundary beyond which the observed value of a test statistic is considered statistically significant. It helps researchers determine whether to reject or accept the null hypothesis in hypothesis testing.
How are degrees of freedom calculated? = (Rows – 1) X (columns – 1)
For each test, determine if (df) or N or N1 and N2 are needed.
Sign Test = N
Mann-Whitney = N1 & N2
Unrelated T Test = N1 & N2
Wilcoxon Matched pairs = N
Related T Test = N
Spearman’s Rho = N
Pearson’s Product = N
Chi-Square = df
Rho = 0.410 for a one-tailed test where N =20 = Statistically significant
Rho = 0.5 for a two-tailed test where N = 10 = Not statistically significant
Х2 = 3.24 for a two-tailed test with a 2x2 contingency table = Not statistically significant
Х2 =5.00 for a one-tailed test with a 3x2 contingency table = Statistically significant
U = 16 for a one-tailed test with 9 PPs in one group and 8 in the other group.= Statistically significant
U= 76 for a two-tailed test with 30 PPs split equally between the 2 conditions = Not statistically significant
T= 54 for a one-tailed test with 25 PPs = Statistically significant
T= 105 for a two-tailed test with 20PPs = Not statistically significant
Wilcoxon Matched Pairs, n=12, directional hypothesis at p=≤0.05, T = 27
At the 5% significance level, the critical/table value for a 1-tailed test, when N =12, is17. Since the observed/calculated value of T is 27, which is more than the critical value, the experimental hypothesis can be rejected, and the null hypothesis can be accepted.
Mann Whitney U test, non-directional hypothesis at p=≤ 0.10, N1 =17 and N2=15, U= 54
At the 10% level of significance, the critical/table value for a 2-tailed test, when N1=17 and N2 =15 is 83 Since the observed/calculated value of U is 54 which is less than the critical value, the experimental hypothesis can be accepted and the null hypothesis can be rejected.
Chi-Squared, non-directional hypothesis at p=≤0.05, df=10, x2=22.42
At the 5% level of significance, the critical/table value for a 2-tailed test, when Df) =10 is 18.31 Since the observed/calculated value of X2 is 22.42 which is more than the critical value, the alternative hypothesis can be accepted and the null hypothesis can be rejected.
Spearman’s Rho, n=25, non-directional hypothesis at p=≤ 0.10, r=0.511
At the 10% level of significance, the critical/table value for a 2-tailed test, when N = 25 is 0.337. Since the observed/calculated value of r is 0.511, more than the critical value, the alternative hypothesis can be accepted, and the null hypothesis can be rejected.
Mann Whitney U test, directional hypothesis at p=≤0.05, N1 =16 and N2=19, U=97
At the 5% significance level, the critical/table value for a 1-tailed test, when N1 = 16 and N2 = 19, is 101. Since U's observed/calculated value is 97, which is less than the critical value, the experimental hypothesis can be accepted and the null hypothesis can be rejected.
Chi-Squared, non-directional hypothesis at p=≤ 0.10, df=36, x2=50
At the 10% significance level, the critical/table value for a 2-tailed test, when Df = 10, is 47.21. Since the observed/calculated value of X2 is 50, which is more than the critical value, the alternative hypothesis can be accepted, and the null hypothesis can be rejected.
Spearman’s Rho, n=11, directional hypothesis at p=≤0.05, r=0.421
At the 5% significance level, the critical/table value for a 1-tailed test, when N = 11, is 0.536. Since the observed/calculated value of r is 0.421, less than the critical value, the alternative hypothesis can be rejected, and the null hypothesis can be accepted.
Mann Whitney U test, directional hypothesis at p=≤0.05, N1 =20 and N2=20, U= 136
At the 5% significance level, the critical/table value for a 1-tailed test, when N1= 20 and N2 = 20, is 138. Since U's observed/calculated value is 136, less than the critical value, the experimental hypothesis can be accepted, and the null hypothesis can be rejected.
Chi-Squared, non-directional hypothesis at p=≤ 0.10, df=27, x2=45.78
At the 10% level of significance, the critical/table value for a 2-tailed test, when Df) = 27, is 36.74. Since the observed/calculated value of X2 is 45.78, which is more than the critical value, the alternative hypothesis can be accepted, and the null hypothesis can be rejected
Spearman’s Rho, n=19, directional hypothesis at p=≤0.05, r=3.9
At the 5% significance level, the critical/table value for a 1-tailed test, when N = 19, is 0.391. Since the observed/calculated value of r is 3.9, which is more than the critical value, the alternative hypothesis can be accepted, and the null hypothesis can be rejected.
Mann Whitney U test, non-directional hypothesis at p=≤ 0.10, N1 =12 and N2=13, U=39
At the 10% significance level, the critical/table value for a 2-tailed test, when N1= 12 and N2 = 13, is 47. Since U's observed/calculated value is 39, which is less than the critical value, the experimental hypothesis can be accepted, and the null hypothesis can be rejected.
Chi-Squared, non-directional hypothesis at p=≤0.05, df=14, x2=18.17
At the 5% significance level, the critical/table value for a 2-tailed test, when Df) = 14, is 23.68. Since the observed/calculated value of X2 is 18.17, less than the critical value, the alternative hypothesis can be rejected, and the null hypothesis cannot be accepted.
Below is how to write it in a report/research results section.
The critical value of T for N =20 for a one-tailed test where p ≤=0.05 is 60. As the observed value of T (29.5) is less than the critical/table value, the likelihood of my results occurring by chance is less than 5% (p ≤ 0.05). Therefore, I can reject my Null hypothesis and accept my experimental hypothesis.
Name a statistical test that is appropriate for this investigation and give three reasons why it was appropriate to use this statistical test (4 marks)

The researcher should have chosen a Wilcoxon matched pairs test because a test of difference was needed: one mark.
The level of measurement was interval/ratio (e.g., the gaps between variables were mathematical, and the data could be ordered). 2 marks.
The design was repeated measures. One mark