Joseph Malkevitch: Teaching Via Contexts: Statistics and Facilities Location

Facility Location and Statistics

Prepared by:

Joseph Malkevitch
Department of Mathematics
York College (CUNY)
Jamaica, NY 11451

Email: malkevitch@york.cuny.edu (for additions, suggestions, and corrections)

The purpose of these notes is to show some parallels and connections between the problem of finding a typical value for a data set and finding a "central" place to locate a public or commercial facility (e.g. medical center; fast food restaurant).

Given the set of data (for convenience arranged in ascending order):

Data Set 1:

1000, 3000, 4000, 5000, 1000000

What number is "typical" of this set of data?

The notion of using a single number to represent a set of data values is very useful, and the word "average" is often used to describe what the number tells us. What is the average weight for the children in a class? What is the average height of the children? What is the average income of a group of people? What is a typical pants size for 17 year olds?

In statistics, the fancy phrase for "average" is "measure of central tendency." This name is very suggestive: we seek a way to measure which number is in the "middle." However, there are different senses in which this can be done. This is an important lesson for students in lower grades. Not only do measures of central tendency offer students in lower grades the opportunity to practice arithmetic in a more meaningful context but it raises a variety of important lessons about the way mathematics interacts with the world around us (e.g. mathematical modeling). Cats can be skinned in different ways.

a. Mean

The mean for a collection of data is the most familiar average. We add all the data up and divide by the number of measurements.

For Data Set 1 above we have:

1000 + 3000 + 4000 + 5000 +1000000 = 1013000/5 = 202600.

We divide by 5 because there are 5 measurements. Note that in this example the original data are all integers and the mean is an integer. However, it is common for the mean to not be an integer, even when the original data values are integers. This means that the mean need not be a legal value that can result in the context where the original data was collected.

b. Median

The idea is to find a number such that half of the measurements are above this number and half are below it. Arrange the data in increasing order. If there are an odd number of measurements we take the number in the middle. If there are an even number of data we take the two numbers in the middle and compute their mean.

For the data above, where there is an odd number of data, the middle measurement is 4000, and this would be the median.

Here is an example for where there an even number of data, say, the grades that a student got on a series of 6 quizzes:

Data Set 2:

72, 84, 86, 87, 92, 92.

The two numbers in the middle are 86 and 87. Taking their mean we get (86 + 87/2 = 86.5 as the median of the original set. Notice, that the median of a collection of integers need not be an integer. The median of a set of numbers may not be a possible outcome for the data in the original context.

c. Mode

The mode for a collection of data is the data value which occurs most frequently. If all the numbers appear only once we say there is no mode; if different numbers occur equally often, each of these is a mode value. Thus, the mode may not exist and if it does, it may not be unique. If the mode exists it must be a value attained by the original measurements.

For Data Set 1 there is no mode, while the mode for Data Set 2 is 92.

Data Set 3:

12, 13, 20, 20, 34, 50, 50, and 70.

The modes are 20 and 50.

d. Mid-range value

The mid-range value is the number which is halfway (mean) between the largest and smallest values in a data set. (The range for a set of data is the difference between the largest and smallest values of the data. See the discussion below for more information about the range.)

In the example above, the largest value is 1000000 and the smallest 1000, so the mid-range value is 500500.

It is uncommon for these different measures of central tendency to give the same value for a given data set. How can one choose between them? There are various pros and cons of using these numbers as a measure of central value. As as simple example, if a store orders lots of the mean size for shoes for next month, based on the sizes that it sold for this month, even if this number is an integer it may not be a wise choice. A better choice might be the mode size for the previous month.

The mean is often not a good measure of central tendency, especially for income data. Thus, if Data Set 1 above represents incomes, the mean income might give someone the impression that these individuals were rich, because the mean is over $200,000. However, the median value of $4000 gives the better picture, that half of the people got more than $4000 and half less than $4000, while obscuring the fact that one person had a high income.

When discussing these different measures of central tendency there are interesting computational questions that can be mentioned. This enables one to discuss the difference between concerns of mathematicians and computer scientists. Even if one knows how to solve a problem in a conceptually simple framework, the amount of computation may make what one has in mind impractical. To compute the mean of a collection of a 1,000,000 numbers does not require the numbers be sorted. However, to find the median of the same set requires that one sort the numbers first. For a large data set, in higher grades, one can point out that to sort a data set of size n the number of "fundamental operations" (say, pairwise comparisons) has to be on the order of n(log n). While find the median requires sorting, finding the midrange value only requires finding the smallest and largest number in the set. This allows one to talk about the concept of the maximum and minimum for a set of numbers.

It is not surprising that trying to replace a complex data set with a single number, a measure of the data's central tendency would not always successfully capture what was going on. However, when a measure of central tendency is combined with another measure, that of dispersion or spread of data, it is remarkable how much information one gets. (The standard pair of numbers chosen is the mean and the standard deviation.)

The best know measures of dispersion are the range, the mean deviation, and the standard deviation. Some of these measures of the spread for data will be illustrated with this very small (unsorted) data set:

Data Set 4

-3, 5, 7, 3.

Note that the mean for this data is 3.

The range is the simplest measure of dispersion to describe. It is the difference between the largest and smallest data values.

For Data Set 4 above, the largest (maximum) element is 7 and the smallest (minimum) is -3, thus, the range is 7 - (-3) = 10.

The mean deviation is the sum of the absolute values of the data and the mean, divided by the number of measurements in the data set. One might ask why one might not just compute the sum of the numbers (data value - mean)?

If we do this for Data Set 4 we get:

(-3 - 3) + (5-3) + (7 -3) + (3 - 3) = -6 + 2 + 4 + 0.

In fact, it is a general theorem that for the mean of any set of numbers the sum of the data values minus the mean adds up to 0. Thus, this approach to finding a measure of spread of the data is not that valuable. The next best thing is to compute the absolute value of the data values minus the mean, and take the mean of these numbers. This is the number know as the mean deviation.

For our example (Data Set 4) we get:

(|-3 -3 | + | 5 -3 | + | 7 -3 | + | 3 - 3 |)/4 = 12/3 = 4.

Thus, for this data set the mean deviation is 4. Though conceptually simple, the mean deviation does not have as many nice mathematical properties as the standard deviation.

Now we will turn to the question of relating the location of a facility, which we would like to be centrally located and how this relates to finding a measure of central tendency for data. The concept of finding a value which is in "the middle" for a set of data is closely related to the issue of finding a physical point which is a good site for a mobile library truck, a medical center, or a ice-cream truck. In operations research, the branch of mathematics devoted to helping individuals, companies and governments operate more efficiently, this circle of situations is known as facility location problems.

Suppose:

Data Set 5:

-4, 4, 10, 15, 60

we have houses located at the coordinates of Data Set 5, which lie along a line. We want to know the best place to locate a mobile library facility in a "central location." (For convenience later, think of house A as being at -4, B at 4, C at 10, D at 15 and E at 60.)

We will consider two different measures of optimality:

a. Minimizing the maximum distance of any house from the mobile library

(Here the perspective is that no person will have to travel especially far to get to the facility.)

b. Minimizing the sum of the distances of the houses from the mobile library.

(Although some people may have to travel far to get to the facility a "balance" is taken for how far the "typical" person will have to travel.)

Note that if we minimize the sum of the distances of the houses we will minimize the mean distance involved because for a given problem, the number of data stays fixed. Thus, minimizing the sum of the distances minimizes the mean distance as well.

Although I will not provide proofs here, the facts, are as follows:

a. Given coordinates for locations along a line, the point located at the midrange value for these coordinates treated as a data set will minimize the maximum distance to the locations.

b. Given the coordinates for locations along a line, the point located at the median for these coordinates treated as a data set will minimize the sum (hence, mean) of the distances

Here are the calculations which show that these "theorems" are true for Data Set 5.

The range of the data is 60 - (-4) = 64. Half of this is 32. Thus, the midrange value is at -4 + 32 = 28. (Note that 60 - 32 is also 28, and that if take (-4 + 60)/2 we also get 28 as the location of the midrange value. No one must travel more than 32 units to get from home to a facility located at 28.

We can also compute the median and the mean for this data. Since there are an odd number of numbers in the data set you can verify that 10 is the median and that the mean is 17 (since (-4 + 4 + 10 + 15 + 60)/ 5 = 85/5 = 17).

Note that if one locates the mobile library at either 10 or 17 then some people will have to travel more than 32 units to get there. (From the house at 60 it would be 50 units to the mobile library if it were at 10; from the house at 60 it would be 43 units to the mobile library if it were put at 17. This, shows that placing the mobile library at 28 is an improvement with respect to minimizing the maximum distance.)

Now, let us see how well locating the mobile library at the midrange value, the median and the mean compare with respect to the sum of the distance to the five house locations.

The following table will help us see what is going on. The entries in a row show the distance from a particular house to the mobile library if it is located at the three different locations for the mobile library, the midrange value, the median, and the mean.

	midrange = 28	median = 10	mean = 17
A = -4	32	14	21
B = 4	24	6	13
C = 10	18	0	7
D = 15	13	5	2
E = 60	32	50	43
Sum	119	75	86

Which location will minimize the sum of the distances? The answer is the median. Note that one house must go 50 units to reach the median (which is worse than what have been the case for the midrange value) but the total distance is quite a bit smaller if the mobile library is at the median. Similarly, though the mean does worse than the median in terms of the sum of the values to get there there from all of the houses.

What happens if there are an even number of houses. Will the median still be the place to put the mobile library to minimize the sum of the distances? It turns out the median will still be optimal from the sums of distances point of view. However, something interesting happens here. The the solution is not unique! It turns out that if there are an even number of locations, arranged from largest to smallest, then any location Z between the two middle values will minimize the sum of the distances from Z to the houses. Thus, if there an even number of house locations, and the mean falls somewhere within the interval of the two middle points, then the mean will minimize the sum of the distances. Because of this, if you are doing this topic in an inquiry based mode, it is best to start with a problem with an odd number of locations. Once, students realize that locating the mobile library at the mean will not in this case minimize the sum of the distances, when one moves on to a problem involving an even number of houses, they may not get confused if both the median and the mean give the same answer and will be able to discover for themselves that it is the non-uniqueness of the result in this case that causes the equality of the result for the median and the mean.

In addition to the meaning technique's reviewed and used in this example of context based learning there are some important mathematical modeling activities that can be used. In the discussions above we have mentioned the location of a mobile library facility so that a particular goal, minimizing the maximum distance or minimizing the sum of the distances is achieved. However, it is very valuable to ask students to find other similar types of situations to the one used in the specific context discussed. Thus, a student might notice that similar problems arise in locating a medical center, where to locate a hot dog stand on the beach at Coney Island, where to locate a firehouse or a police station. As part of this process it is important to get students to be able to compare and contrast two different situations. What similarities does locating a firehouse and locating a police station have and what differences? Carrying out exercises of this kind shows that mathematics is not a sterile abstract subject cut off from other aspects of students' lives. Mathematical techniques and models are not only a tool for students interested in mathematics itself, science, and engineering, but also for the humanities and social sciences. It is tools for all seasons!

Acknowledgment

This work was supported in part by the Teacher Academy of York College. Specific funding was provided by: FIPSE (46274-07 01) and the Fund for PS (72042-07 01) to the Teacher Academy of CUNY.