2.4 Relative Position of Data

Learning Objectives

  1. To learn the concept of the relative position of an element of a data set.
  2. To learn the meaning of each of two measures, the percentile rank and the z-score, of the relative position of a measurement and how to compute each one.
  3. To learn the meaning of the three quartiles associated to a data set and how to compute them.
  4. To learn the meaning of the five-number summary of a data set, how to construct the box plot associated to it, and how to interpret the box plot.

When you take an exam, what is often as important as your actual score on the exam is the way your score compares to other students’ performance. If you made a 70 but the average score (whether the mean, median, or mode) was 85, you did relatively poorly. If you made a 70 but the average score was only 55 then you did relatively well. In general, the significance of one observed value in a data set strongly depends on how that value compares to the other observed values in a data set. Therefore we wish to attach to each observed value a number that measures its relative position.

Percentiles and Quartiles

Anyone who has taken a national standardized test is familiar with the idea of being given both a score on the exam and a “percentile ranking” of that score. You may be told that your score was 625 and that it is the 85th percentile. The first number tells how you actually did on the exam; the second says that 85% of the scores on the exam were less than or equal to your score, 625.

Definition

Given an observed value x in a data set, x is the Pth percentileThe measurement x, if it exists, such that P percent of the data are less than or equal to x. of the data if the percentage of the data that are less than or equal to x is P. The number P is the percentile rankOf a measurement x, the percentage of the data that are less than or equal to x. of x.

Example 13

What percentile is the value 1.39 in the data set of ten GPAs considered in Note 2.12 "Example 3" in Section 2.2 "Measures of Central Location"? What percentile is the value 3.33?

Solution:

The data written in increasing order are

1.391.761.902.122.532.713.003.333.714.00

The only data value that is less than or equal to 1.39 is 1.39 itself. Since 1 is 1∕10 = .10 or 10% of 10, the value 1.39 is the 10th percentile. Eight data values are less than or equal to 3.33. Since 8 is 8∕10 = .80 or 80% of 10, the value 3.33 is the 80th percentile.

The Pth percentile cuts the data set in two so that approximately P% of the data lie below it and (100P)% of the data lie above it. In particular, the three percentiles that cut the data into fourths, as shown in Figure 2.12 "Data Division by Quartiles", are called the quartilesOf a data set, the three numbers Q1, Q2, Q3 that divide the data approximately into fourths.. The following simple computational definition of the three quartiles works well in practice.

Figure 2.12 Data Division by Quartiles

Definition

For any data set:

  1. The second quartile Q2 of the data set is its median.
  2. Define two subsets:
    1. the lower set: all observations that are strictly less than Q2;
    2. the upper set: all observations that are strictly greater than Q2.
  3. The first quartile Q1 of the data set is the median of the lower set.
  4. The third quartile Q3 of the data set is the median of the upper set.

Example 14

Find the quartiles of the data set of GPAs of Note 2.12 "Example 3" in Section 2.2 "Measures of Central Location".

Solution:

As in the previous example we first list the data in numerical order:

1.391.761.902.122.532.713.003.333.714.00

This data set has n = 10 observations. Since 10 is an even number, the median is the mean of the two middle observations: x~=(2.53+2.71)2=2.62. Thus the second quartile is Q2=2.62. The lower and upper subsets are

Lower:L={1.39,1.76,1.90,2.12,2.53}Upper:U={2.71,3.00,3.33,3.71,4.00}

Each has an odd number of elements, so the median of each is its middle observation. Thus the first quartile is Q1=1.90, the median of L, and the third quartile is Q3=3.33, the median of U.

Example 15

Adjoin the observation 3.88 to the data set of the previous example and find the quartiles of the new set of data.

Solution:

As in the previous example we first list the data in numerical order:

1.391.761.902.122.532.713.003.333.713.884.00

This data set has 11 observations. The second quartile is its median, the middle value 2.71. Thus Q2=2.71. The lower and upper subsets are now

Lower:L={1.39,1.76,1.90,2.12,2.53}Upper:U={3.00,3.33,3.71,3.88,4.00}

The lower set L has median the middle value 1.90, so Q1=1.90. The upper set has median the middle value 3.71, so Q3=3.71.

In addition to the three quartiles, the two extreme values, the minimum xmin and the maximum xmax are also useful in describing the entire data set. Together these five numbers are called the five-number summaryOf a data set, the list {xmin,Q1,Q2,Q3,xmax}. of the data set:

{xmin,Q1,Q2,Q3,xmax}

The five-number summary is used to construct a box plotFor a data set, a diagram constructed using the five-number summary, as in Figure 2.13 "The Box Plot", which graphically summarizes the distribution of the data. as in Figure 2.13 "The Box Plot". Each of the five numbers is represented by a vertical line segment, a box is formed using the line segments at Q1 and Q3 as its two vertical sides, and two horizontal line segments are extended from the vertical segments marking Q1 and Q3 to the adjacent extreme values. (The two horizontal line segments are referred to as “whiskers,” and the diagram is sometimes called a “box and whisker plot.”) We caution the reader that there are other types of box plots that differ somewhat from the ones we are constructing, although all are based on the three quartiles.

Figure 2.13 The Box Plot

Note that the distance from Q1 to Q3 is the length of the interval over which the middle half of the data range. Thus it has the following special name.

Definition

The interquartile range (IQR)Of a data set, the difference between the first and third quartiles. is the quantity

IQR=Q3Q1

Example 16

Construct a box plot and find the IQR for the data in Note 2.44 "Example 14".

Solution:

From our work in Note 2.44 "Example 14" we know that the five-number summary is

xmin=1.39Q1=1.90Q2=2.62Q3=3.33xmax=4.00

The box plot is

The interquartile range is IQR=3.331.90=1.43.

z-scores

Another way to locate a particular observation x in a data set is to compute its distance from the mean in units of standard deviation.

Definition

The z-scoreOf a measurement x, the distance of x from the mean in units of standard deviation. of an observation x is the number z given by the computational formula

z=xx-sorz=xμσ

according to whether the data set is a sample or is the entire population.

The formulas in the definition allow us to compute the z-score when x is known. If the z-score is known then x can be recovered using the corresponding inverse formulas

x=x-+szorx=μ+σz

The z-score indicates how many standard deviations an individual observation x is from the center of the data set, its mean. If z is negative then x is below average. If z is 0 then x is equal to the average. If z is positive then x is above average. See Figure 2.14.

Figure 2.14 x-Scale versus z-Score

Example 17

Find the z-scores for all ten observations in the GPA sample data in Note 2.12 "Example 3" in Section 2.2 "Measures of Central Location".

1.903.002.533.712.121.762.711.394.003.33

Solution:

For these data x-=2.645 and s = 0.8674. The first observation x = 1.9 in the data set has z-score

z=xx-s=1.92.6450.8674=0.8589

which means that x = 1.90 is 0.8589 standard deviations below the sample mean. The second observation x = 3.00 has z-score

z=xx-s=3.002.6450.8674=0.4093

which means that x = 3.00 is 0.4093 standard deviations above the sample mean. Repeating the process for the remaining observations gives the full set of z-scores

0.860.410.131.230.611.020.071.451.560.79

Example 18

Suppose the mean and standard deviation of the GPAs of all currently registered students at a college are μ = 2.70 and σ = 0.50. The z-scores of the GPAs of two students, Antonio and Beatrice, are z=0.62 and z = 1.28, respectively. What are their GPAs?

Solution:

Using the second formula right after the definition of z-scores we compute the GPAs as

Antonio: x=μ+zσ=2.70+(0.62)(0.50)=2.39Beatrice: x=μ+zσ=2.70+(1.28)(0.50)=3.34

Key Takeaways

  • The percentile rank and z-score of a measurement indicate its relative position with regard to the other measurements in a data set.
  • The three quartiles divide a data set into fourths.
  • The five-number summary and its associated box plot summarize the location and distribution of the data.

Exercises

    Basic

  1. Consider the data set

    69926877809375768210070858885965370708285
    1. Find the percentile rank of 82.
    2. Find the percentile rank of 68.
  2. Consider the data set

    8.58.27.07.04.99.68.58.88.58.76.58.27.61.59.38.07.72.99.26.9
    1. Find the percentile rank of 6.5.
    2. Find the percentile rank of 7.7.
  3. Consider the data set represented by the ordered stem and leaf diagram

    10009111123801122345788970001124456667778896012223445777788502334467789425688399
    1. Find the percentile rank of the grade 75.
    2. Find the percentile rank of the grade 57.
  4. Is the 90th percentile of a data set always equal to 90%? Why or why not?

  5. The 29th percentile in a large data set is 5.

    1. Approximately what percentage of the observations are less than 5?
    2. Approximately what percentage of the observations are greater than 5?
  6. The 54th percentile in a large data set is 98.6.

    1. Approximately what percentage of the observations are less than 98.6?
    2. Approximately what percentage of the observations are greater than 98.6?
  7. In a large data set the 29th percentile is 5 and the 79th percentile is 10. Approximately what percentage of observations lie between 5 and 10?

  8. In a large data set the 40th percentile is 125 and the 82nd percentile is 158. Approximately what percentage of observations lie between 125 and 158?

  9. Find the five-number summary and the IQR and sketch the box plot for the sample represented by the stem and leaf diagram in Figure 2.2 "Ordered Stem and Leaf Diagram".

  10. Find the five-number summary and the IQR and sketch the box plot for the sample explicitly displayed in Note 2.20 "Example 7" in Section 2.2 "Measures of Central Location".

  11. Find the five-number summary and the IQR and sketch the box plot for the sample represented by the data frequency table

    x12589f52364
  12. Find the five-number summary and the IQR and sketch the box plot for the sample represented by the data frequency table

    x532101345f213241121
  13. Find the z-score of each measurement in the following sample data set.

    56210
  14. Find the z-score of each measurement in the following sample data set.

    1.65.22.83.74.0
  15. The sample with data frequency table

    x127f121

    has mean x-=3 and standard deviation s ≈ 2.71. Find the z-score for every value in the sample.

  16. The sample with data frequency table

    x1014f1131

    has mean x-=1 and standard deviation s ≈ 1.67. Find the z-score for every value in the sample.

  17. For the population

    0022

    compute each of the following.

    1. The population mean μ.
    2. The population variance σ2.
    3. The population standard deviation σ.
    4. The z-score for every value in the population data set.
  18. For the population

    0.52.14.41.0

    compute each of the following.

    1. The population mean μ.
    2. The population variance σ2.
    3. The population standard deviation σ.
    4. The z-score for every value in the population data set.
  19. A measurement x in a sample with mean x-=10 and standard deviation s = 3 has z-score z = 2. Find x.

  20. A measurement x in a sample with mean x-=10 and standard deviation s = 3 has z-score z=1. Find x.

  21. A measurement x in a population with mean μ = 2.3 and standard deviation σ = 1.3 has z-score z = 2. Find x.

  22. A measurement x in a sample with mean μ = 2.3 and standard deviation σ = 1.3 has z-score z=1.2. Find x.

    Applications

  1. The weekly sales for the last 20 weeks in a kitchen appliance store for an electric automatic rice cooker are

    201514141815191213915171616181915151615
    1. Find the percentile rank of 15.
    2. If the sample accurately reflects the population, then what percentage of weeks would an inventory of 15 rice cookers be adequate?
  2. The table shows the number of vehicles owned in a survey of 52 households.

    x01234567f21215116312
    1. Find the percentile rank of 2.
    2. If the sample accurately reflects the population, then what percentage of households have at most two vehicles?
  3. For two months Cordelia records her daily commute time to work each day to the nearest minute and obtains the following data:

    x26272829303132f341612621

    Cordelia is supposed to be at work at 8:00 a.m. but refuses to leave her house before 7:30 a.m.

    1. Find the percentile rank of 30, the time she has to get to work.
    2. Assuming that the sample accurately reflects the population of all of Cordelia’s commute times, use your answer to part (a) to predict the proportion of the work days she is late for work.
  4. The mean score on a standardized grammar exam is 49.6; the standard deviation is 1.35. Dromio is told that the z-score of his exam score is −1.19.

    1. Is Dromio’s score above average or below average?
    2. What was Dromio’s actual score on the exam?
  5. A random sample of 49 invoices for repairs at an automotive body shop is taken. The data are arrayed in the stem and leaf diagram shown. (Stems are thousands of dollars, leaves are hundreds, so that for example the largest observation is 3,800.)

    356830011242566778899200001224155566777889100134440568804

    For these data, Σx=101,100, Σx2=244,830,000.

    1. Find the z-score of the repair that cost $1,100.
    2. Find the z-score of the repairs that cost $2,700.
  6. The stem and leaf diagram shows the time in seconds that callers to a telephone-order center were on hold before their call was taken.

    00000001111111122222333333344444055555555566666666667777778889910011112222441566892242530
    1. Find the quartiles.
    2. Give the five-number summary of the data.
    3. Find the range and the IQR.

    Additional Exercises

  1. Consider the data set represented by the ordered stem and leaf diagram

    10009111123801122345788970001124456667778896012223445777788502334467789425688399
    1. Find the three quartiles.
    2. Give the five-number summary of the data.
    3. Find the range and the IQR.
  2. For the following stem and leaf diagram the units on the stems are thousands and the units on the leaves are hundreds, so that for example the largest observation is 3,800.

    356830011242566778899200001224155566777889100134440568804
    1. Find the percentile rank of 800.
    2. Find the percentile rank of 3,200.
  3. Find the five-number summary for the following sample data.

    x26272829303132f341612621
  4. Find the five-number summary for the following sample data.

    x12345678910f384208985628128231
  5. For the following stem and leaf diagram the units on the stems are thousands and the units on the leaves are hundreds, so that for example the largest observation is 3,800.

    356830011242566778899200001224155566777889100134440568804
    1. Find the three quartiles.
    2. Find the IQR.
    3. Give the five-number summary of the data.
  6. Determine whether the following statement is true. “In any data set, if an observation x1 is greater than another observation x2, then the z-score of x1 is greater than the z-score of x2.

  7. Emilia and Ferdinand took the same freshman chemistry course, Emilia in the fall, Ferdinand in the spring. Emilia made an 83 on the common final exam that she took, on which the mean was 76 and the standard deviation 8. Ferdinand made a 79 on the common final exam that he took, which was more difficult, since the mean was 65 and the standard deviation 12. The one who has a higher z-score did relatively better. Was it Emilia or Ferdinand?

  8. Refer to the previous exercise. On the final exam in the same course the following semester, the mean is 68 and the standard deviation is 9. What grade on the exam matches Emilia’s performance? Ferdinand’s?

  9. Rosencrantz and Guildenstern are on a weight-reducing diet. Rosencrantz, who weighs 178 lb, belongs to an age and body-type group for which the mean weight is 145 lb and the standard deviation is 15 lb. Guildenstern, who weighs 204 lb, belongs to an age and body-type group for which the mean weight is 165 lb and the standard deviation is 20 lb. Assuming z-scores are good measures for comparison in this context, who is more overweight for his age and body type?

    Large Data Set Exercises

  1. Large Data Set 1 lists the SAT scores and GPAs of 1,000 students.

    http://www.gone.2012books.lardbucket.org/sites/all/files/data1.xls

    1. Compute the three quartiles and the interquartile range of the 1,000 SAT scores.
    2. Compute the three quartiles and the interquartile range of the 1,000 GPAs.
  2. Large Data Set 10 records the scores of 72 students on a statistics exam.

    http://www.gone.2012books.lardbucket.org/sites/all/files/data10.xls

    1. Compute the five-number summary of the data.
    2. Describe in words the performance of the class on the exam in the light of the result in part (a).
  3. Large Data Sets 3 and 3A list the heights of 174 customers entering a shoe store.

    http://www.gone.2012books.lardbucket.org/sites/all/files/data3.xls

    http://www.gone.2012books.lardbucket.org/sites/all/files/data3A.xls

    1. Compute the five-number summary of the heights, without regard to gender.
    2. Compute the five-number summary of the heights of the men in the sample.
    3. Compute the five-number summary of the heights of the women in the sample.
  4. Large Data Sets 7, 7A, and 7B list the survival times in days of 140 laboratory mice with thymic leukemia from onset to death.

    http://www.gone.2012books.lardbucket.org/sites/all/files/data7.xls

    http://www.gone.2012books.lardbucket.org/sites/all/files/data7A.xls

    http://www.gone.2012books.lardbucket.org/sites/all/files/data7B.xls

    1. Compute the three quartiles and the interquartile range of the survival times for all mice, without regard to gender.
    2. Compute the three quartiles and the interquartile range of the survival times for the 65 male mice (separately recorded in Large Data Set 7A).
    3. Compute the three quartiles and the interquartile range of the survival times for the 75 female mice (separately recorded in Large Data Set 7B).

Answers

    1. 60.
    2. 10.
    1. 59.
    2. 23.
    1. 29.
    2. 71.
  1. 50%.

  2. xmin=25, Q1=70, Q2=77.5, Q3=90, xmax=100, IQR=20

  3. xmin=1, Q1=1.5, Q2=6.5, Q3=8, xmax=9, IQR=6.5

  4. −1.3, 1.39, 0.4, −0.35, −0.11.

  5. z=0.74 for x = 1, z=0.37 for x = 2, z = 1.48 for x = 7.

    1. 1.
    2. 1.
    3. 1.
    4. z=1 for x = 0, z = 1 for x = 2.
  6. 16.

  7. 4.9.

    1. 55.
    2. 55.
    1. 93.
    2. 0.07.
    1. −1.11.
    2. 0.73.
    1. Q1=59, Q2=70, Q3=81.
    2. xmin=39, Q1=59, Q2=70, Q3=81, xmax=100.
    3. R = 61, IQR=22.
  1. xmin=26, Q1=28, Q2=28, Q3=29, xmax=32.

    1. Q1=1450, Q2=2000, Q3=2800.
    2. IQR=1350.
    3. xmin=400, Q1=1450, Q2=2000, Q3=2800, xmax=3800.
  2. Emilia: z=.875, Ferdinand: z=1.16-.

  3. Rosencrantz: z = 2.2, Guildenstern: z = 1.95. Rosencrantz is more overweight for his age and body type.

    1. xmin=15, Q1=51, Q2=67, Q3=82, and xmax=97.
    2. The data set appears to be skewed to the left.
    1. Q1=440, Q2=552.5, Q3=661, and IQR=221.
    2. Q1=641, Q2=667, Q3=700, and IQR=59.
    3. Q1=407, Q2=448, Q3=504, and IQR=97.