I have have been critical of our new Austin MLS System called MLXChange since it was first placed into production in Austin in November 2007. It produces what I believe to be incorrect data. I was contacted yesterday by a product manager from MLXChange because they saw my last blog article showcasing some of the bad data. In that article, I mentioned that the price per square foot is not calculated correctly by the system.
The product manager was very nice and I enjoyed our conversation. She wanted to explain to me how the average price per square foot numbers are calculated and that, in fact, the numbers are calculated correctly. She is right on a certain level. That is, if one understands and follows the formula being used by MLXChange, it can be argued that their number is a mathematically correct.
But I maintain that the method being used by MLXChange to calculate average price per square foot is not the correct formula. I’m not smart enough to articulate it in a scientific or mathematical argument, but my intuition and instinct tells me I’m right. I’ll do my best to explain below and I hope a smart reader can chime in and offer a mathematical or statistical point of view and explanation of why two different methods, each of which seems correct independently on face value, produce different results.
So let’s look at an example that illustrates the MLXChange formula versus the method I believe most of us would use, and the method I believe makes the most sense. I’ve created a small sample set of data, shown in the chart below.
Square Feet

Sold Price

Sold $ per SQFT


House 1 
1200

$150,000

$125.00

House 2 
1300

$110,000

$84.62

House 3 
1400

$95,000

$67.86

House 4 
1500

$120,000

$80.00

House 5 
1600

$85,000

$53.13

MLXChange Avg 
1400

$112,000

$82.12

My Average 
1400

$112,000

$80.00

What is the average sold price per square foot of the 5 homes shown above?
MLXChange computes the Average Price per Square Foot by adding together all of the individual price per square foot numbers (the five psf numbers in the far right column) and dividing by the number of sales (five). This produces a result of $82.12 per square foot as the average price per square foot of the sold homes in our sample set.
I compute the average price per square foot by taking the average sqft size of all homes (1400) and dividing it into the average sold price of all homes ($112,000). My result is $80 per square foot. That’s what you would see on a stats chart that I manually produce and post regularly on this blog.
The two methods, in this particular example, produce results that are 2.65% different. 2.65% is not an insignificant amount when pricing a home. It’s a bigger gap than the List/Sold price difference on most sold homes in Austin. It’s a $5,300 difference on a 2500 sqft home at our example psf rates. Would you like to sell your home for $5,300 too little? Would you like to pay $5,300 too much, based on your Realtor’s CMA, produced by MLXChange?
Let’s dig deeper into why I think the MLXChange method is flawed.
First of all, what really bothers me as a Realtor and a follower of the Austin Real Estate Market is that the MLXChange computation method produces a result that is visually and mathematically disconnected from the relationship between the two other numbers that the result is suppose to represent.
In other words, a casual observer of the above chart, were it to show only the $82.12 per square foot, would not be able to easily understand how $82.12 relates to the average sold price of $112,000 or the average sqft size of 1400. As I told the MLXChange rep, it’s a “voodoo” number. A mystery result which can’t be reproduced or validated with the numbers provided on a summary stats report. Sure, we can add together these 5 numbers in my example, when when running a quarterly comparison report with 5,000 sales, comparing the current year to the previous year, we can’t manually validate the data. We just look at the number and say “that ain’t right”.
When I learned math, a Numerator (Avg Sales Price) divided by a Denominator (Avg Size SQFT) equals a number than can be operated in reverse with either of the other two numbers to produce the third. So, an average price per square figure necessarily says to me that “$80 represents the answer derived from dividing $112,000 by 1400”.
I reject and do not accept that $82.12 can represent $112,000 divided by 1400. It simply doesn’t compute.
Why do the different methods produce different results in the first place?
I don’t know for certain why the MLXChange method produces a result that can vary significantly from my method, but I suspect it may have something to do with the combination of rounding and averages in general. Maybe there is some invisible embedded calculus involved. Sometimes there is little if any variance at all between the result produced by each method, but the results are rarely if ever the same across the two methods. In homogeneous data sets with a narrow band of price and size, the numbers remain fairly close. More diverse ares, such as Central Austin, where smaller homes can sell for more than bigger homes, and vice versa, the variance can look similar to what is shown in my example. Certainly, when evaluating data across the entire Austin market, a tremendous amount of diversity exists in the individual per square foot sold amounts.
Like I said, this where my ability to articulate the problem falls apart from a scientific or mathematical standpoint. I won’t argue that if we add the numbers in the right column and then divide by 5, that the answer is $82.12. But 82.12 is not the result of dividing 1400 into $112,000, and that is what the result is supposed to be representing, so it’s wrong in that context.
Why is the Other Method Better?
My method is better because it’s more static and less susceptible to unexplainable swings. No matter what the variance is in sold prices, square footage sizes of homes, or how dishomogeneous the neighborhoods from where the data is drawn, the answer is always the same relative to the numbers it represents. The relationship between the results can be visually understood and validated with simple math. In other words, Avg Sold Price divided by Average Square Footage always equals a constant result that can be reverse operated and validated with the other two numbers. A casual observer of a stats report or CMA would not have questions as to where that psf number came from or think that it might be wrong.
With the MLXChange computation method, the number produced can take wild swings relative to it’s relationship to the other two numbers. In other words, it’s not wrong in a consistent, predictable or observable way. Sometimes the number is way off from my simple method, sometimes it’s very close. Sometimes it’s way higher, sometimes it’s way lower. I find this maddening. Let’s look at an example of what I mean.
Let’s change ONLY the square foot size of TWO of the sales in our chart. I’m simply going to swap the first and fifth square foot number such that the average size and average sold price doesn’t change at all. The fields that do change as a result of this are shaded in grey so you can see what I’m doing.
Square Feet

Sold Price

Sold $ per SQFT


House 1 
1600

$150,000

$93.75

House 2 
1300

$110,000

$84.62

House 3 
1400

$95,000

$67.86

House 4 
1500

$120,000

$80.00

House 5 
1200

$85,000

$70.83

MLXChange Avg 
1400

$112,000

$79.41

My Average 
1400

$112,000

$80.00

Note the the average square foot and average sales price remain unchanged at 1400 and $112,000. So does my Average Sold price per square foot at $80. But look at the average price per square foot of the MLXChange formula computation. Before, it was 2.65% higher than my computation at $82,12. Now it’s 0.74% lower at $79.41. That’s a spread of 3.39% according the the MLXChange formula for calculating average price per square foot. How can the same average data produce such varying results?
Why does it Matter?
Often, our CMA analysis that we print out and use in determining listing or offer prices will include only a handful of comparable sold listings, such as in my sample set above.
In our market, correct or not, agents rely heavily on average price per square foot in setting or defending home prices and values. So do appraisers. As illustrated in my example above, minor changes in the data set can produce inexplicable swings in the price per square foot numbers in MLXChange. These numbers print at the bottom of our CMA reports, which we use to advise buyers and sellers. We are supposed to be informed and empowered by the numbers we see, not confused.
So, depending on whether an agent plucks out “Comparable Sold Property A” versus “Comparable Sold Property B”, to include in a group of 4 or 5 other comps, the CMA could suggest to Realtor A and Seller A that House A should be priced over or under the true market value.
And to Realtor B and Buyer B who are considering an offer on House A, their CMA may suggest a price per square foot 2 or 3 percent different (could be either higher or lower) than the CMA being relied upon by the other side of the deal.
With the more simple method, even if the set of comps are slightly different in size and square footage, the per square foot amount will be less susceptible to variance and both sides of the deal can be measuring apples to apples.
I’m not comfortable with an MLS tool that leaves me scratching my head as to how one of my most important measures of home value is calculated. I don’t know from CMA to CMA when these swings can occur, when they apply, whether they are high or low from how I would do it, or whether the per square foot calculation is an accurate reflection of true market value. So I have resorted to verifying with a calculator.
Nice huh? MLXChange, our “state of the art” MLS Software, better than anything else that could have been purchased for us by our MLS, can’t be relied upon to evaluate the market value of a home. Yes, that chaps me.
So, to MLXChange I ask:
1) Who determined that this computation method is the superior and/or correct method to use for real estate sales analysis?
2) Can you explain why the variance occur and how they benefit us as Realtors?
3) Did our Board of Realtors instruct the computations to be done this way, or did some programmer just think that it would make sense to produce an average of the actual individual per square foot amounts?
4) Why is it better to have a number on a stats report or CMA that doesn’t correlate with the corresponding companion numbers and who told you that we have the time or desire to explain this to buyers and sellers when looking at our CMA reports?
5) Why can’t the price per square foot simply be calculated as I’ve suggested and what bad outcome would result if it were?
Steve,
I can understand your frustration, but the MLX computation is correct. Statistically, the average is the sum of the data set divided by the number of components. The Avg PPSF is the average of the prices per square foot, ie the data set is the PPSF. This will always (unless by chance) produce a different number than the division of the averages of price and SQFT. They are different beasts. I can’t mathematically show the reason in this format, but they are different, and the MLX version is correct. The problem you have, and it is valid, is that the software lays the numbers out next to each other. Any reasonable person would like to see numbers add up, but they don’t. It’s a problem of reporting format, not of how the value is computed. Furthermore, I think this is the reason that median is used more commonly to report real estate statistics. Median is the value at which half of the data points are greater and half are lesser. It has it’s own statistical quirks that must be dealt with but will generally fluctuate less wildly and for this reason is preferred by the real estate community.
Frankly, PPSF should not be used to appraise houses, especially in a nonhomogeneous area like central Austin where the land is a large part of the value of the home….unless the comps used are very close comparables. It works well in suburban areas that have similar construction and amenities.
Out of your 5 questions to MLX, only #4 is one I would want to know the answer to, because it does look odd to have numbers that don’t correlate. And while the world would not come to an end if the calculation were done as you suggest, it would not be the accurate Average Price per Square Foot.
Hi Phil,
Thanks. I’ve already had a mathematician friend of mine go over this with me on the phone, and he said basically what you’re saying, but had a hard time putting it in a nutshell that’s easy to regurgitate, other than “an average of averages isn’t ever going to be statistically correct”.
So, what I’m learning is, the correct way is complicated and hard to understand (not the formula but the reason for the seemingly random variances). My way is technically “wrong” but probably close enough for reporting purposes and makes more sense for reasons I covered in the article.
Steve
The nerdy reason for this is the different types of averages. Arithmetic mean is what we usually mean when we say average, which is what mlxchange is producing. But arithmetic mean is not meant to be used for averaging rates (like dollars per sqft) for exactly the reason you showed. Averaging rates is what harmonic mean (en.wikipedia.org/wiki/Harmonic_mean) is for. Excel has a handy harmean(…) function which does that kind of average.
Anyway, that’s just a fancy way of saying that MLXchange is right that they are giving you the average, but you are right that average isn’t what you or other realtors want đź™‚
Peter,
Thanks. So, maybe I’m not crazy after all.
The article at the wikipedia link you provided is very interesting and explains the issue well.
I really like, near the end of the article, where the author says:
“The arithmetic mean is often incorrectly used in places calling for the harmonic mean.”
That’s what I wasn’t smart enough to know or articulate in this blog article, but I sensed there was something weird happening with the numbers.
And if I understand the article correctly, real estate stats such as we use would be an example, as you stated, where the harmonic mean is the appropriate average to use, not the Arithmetic Mean being produced by MLXChange.
Thanks for the very informative link and your helpful comment.
Steve
Steve,
Exactly. For getting the average of rates (blah per blah), the harmonic mean is the “mathematically correct” way to go. However, if they did that in mlxchange, I’m sure they’d have even more realtors calling and asking them where they pulled that voodoo number from đź™‚
Love the blog!
So we’re looking at the average of averages? That is rediculous and misleading to the public. This really needs to be changed.
> if they did that in mlxchange, Iâ€™m sure theyâ€™d have even more realtors calling and asking them where they pulled that voodoo number from
I honestly don’t think most Realtors would notice the difference. Only the ones who really pay attention to stats and like to crunch the numbers in different ways.
Sam,
> So weâ€™re looking at the average of averages?
if you click on Peter’s link to the Wikipedia article about Harmonic Mean, it explains it. I also liked the “Arithmetic Mean” link in that article which further explained how Arithmetic Mean (what MLXChange uses, can be skewed by outliers while Harmonic Means remains more constant (as my examples showed).
Here is a quote from the AR article:
“The arithmetic mean is greatly influenced by outliers…A classic example is average income. The arithmetic mean may be misinterpreted to imply that most people’s incomes are higher than is in fact the case…For instance, reporting the “average” net worth in Medina, Washington as the arithmetic mean of all annual net worths would yield a surprisingly high number because of Bill Gates.”
This may be why most of the MLXChange produced price psf are higher than the ones I come up with, though they are sometimes lower as well. I think it depends on how many “outliers”, or sales amounts that are way above or below the meat of the pack there are.
I’m learning more about math than I need to know. I do know I want our MLS software to report the stats by whatever method would be deemed appropriate for than type of computation, and it appears they are not.
Steve
As the discussion above is getting at, there is no good rationale for the comparison when you limit your calculations to just square foot and sales price. Your data clearly has an outlier or two — remove that outlier and the variance goes down. While the harmonic mean would limit the influence of such outliers, a better method is to exclude such outliers. Where the agent should add value is by selection of true comparables. This is harder in older and central neighborhoods than outer suburbs where there is a lot more similarity in the houses.
At some level, such selection of comparables and knowledge of local conditions will always be the core value add of a realtor.
Just to give some context, we’ve been trying for years in construction to improve estimating with computer methods — and while we can count bricks better than ever and even systematize some of the art, there will always be room for judgement in assessing total costs.
Overall, I’d say that the averages that you report for market areas are broadly illustrative, but otherwise I would very suspicious of any CMA where the fundamental characteristics — lot size, house size, bedrooms and baths, location, schools, finish quality, soils, landscaping, community amenities aren’t considered. I suspect a limited comparison with true similarities will give a straightforward price for square foot. And when the calculation methods show some divergence, that is where intangibles based on judgement come in — attractiveness, market momentum, etc.
Steve,
Iâ€™ve decided that you shouldnâ€™t be alone in having your head hurt from this response and that I would post to your blog (despite the fact that it is long and tedious).
The problem that you raised has nothing to do with harmonic means and harmonic means do not help you or MLXChange (which I will show below after Iâ€™ve addressed what is truly the issue).
MLXChange gives you the arithmetic mean of the rates (the $/sq.ft.) for a set of houses. What you generated is, in effect, the weighted average of the prices per sq ft of housing. Hopefully this will become clear below.
Letâ€™s say that a 4000 sq ft house is sold at $100/sq ft, a 2000 sq ft house is sold at $80/sq ft, and a 2000 sq ft house is sold at 60/sq ft. When you ask for the average price/sq ft, for what are you truly asking? If you are asking, â€śwhat is the price per sq ft of the average houseâ€ť, then MLXChange, when it says $80/sq ft, is giving the correct figure (because you are asking for the average across houses treated equally as one unit each). But if you are asking, â€śwhat per sq ft does housing cost in Austin (or in some particular neighborhood)â€ť, then, with my example above, you have to account for the fact that much more house was sold at $100/sq ft than at either of the lesser rates, which is accomplished by taking the weighted average. If you take the total price of these three houses and divide by the total sq footage you get a rate of $85/sq ft, which makes sense. Because much more house was sold at $100/sq ft than at either of the lower prices, the weighted average is pushed closer to $100/sq ft. (Although it is complicated, you can calculate weighted averages directly by working with the proportions of the total sq footage sold at different ratesâ€”and this better fits with the concept of a â€śweightedâ€ť average.) So what is it that you want to know, how the price/sq ft compares among houses (treated as discrete units regardless of their relative size), which is the figure that MLXChange gives you, or how much housing costs per sq ft, which must account for the relative size of units sold at different prices?
Harmonic mean is irrelevant. I donâ€™t think that I need to do anything more to prove that than to note that the harmonic mean in the scenario above is $76.6, which canâ€™t possibly be a useful figure. Nevertheless, I will say more: The harmonic mean will always be closer to the lower rate. This is because whatever exactly is the work being done at this rate (traveling along a road or paying a price) it will always take longer at the slower rate, so it is always the case that more time is spent at the lower rate, and thus that the harmonic mean will be closer to the lower rate. For example, if you go up hill at 10mph and down at 30mph your average speed is not 20mph. Since you spent much more time at 10mph than at 30mph, your average is closer to 10 than to 30 (the average rate is actually 15â€”since Âľ of the time is at the lower speed, the rate is Âľ closer to the lower speed). This issue, however, is intrinsic to rates and the fact that lower rates always take longer than higher rates. So harmonic mean is about how to relate rates (on the presumption, in effect, that they are applied over the same total work/distance/price).
Even more of a digression on harmonic mean: To get the figure of $76.6 you would need to do the following: Pick a fixed price, say, $100,000. How large would three houses have to be at that price to have their prices per sq ft equal $100, $80, and $60 respectively? The answer is 1000 for the first, 1250 for the second, and 1666.66 for the third. Now, if you add up those square footages (3916.66), and divide that into the total money spent ($300,000), you will get the figure of $76.6/sq ft. So the harmonic mean is only relevant when you hold constant the amount of work to be done, be it the distance traveled or, in this case, the price to be paid for each house. The figure came out closer to $60 than $100 because, in our example, more of the sq footage was purchased at $60/sq ft that at $100/sq ft. This form of mean is completely irrelevant within your industry because it tells you only something intrinsic to the relationship of the rates, not anything about houses in the community.
This is a fundamentally different thing than a weighted average. Note that in my mph example the distance was the same in both cases and all we were relating was the rates (exactly the sort of case in which harmonic mean is relevant as shown in the wikipedia link and in my paragraph immediately above). The issue would have been rather different if we had traveled different distances at those different rates (and this is exactly what is addressed by a weighted average). Square footage of houses is analogous to distance in the mph example. We need to account for the fact that the rate is applied over different distances (sq footages). Obviously, this isnâ€™t addressed by harmonic means because there is no way to put different distances into the formula (which only addresses the rates themselves).
Well, I hope that this was more interesting than it was tedious.
Don
I wouldn’t spend too much time trying to make these numbers meaningful. By definition these numbers are quite meaningless anyway as they are averages of apples and oranges. For example, if I am interested in a small condo then the amount that the 6th bedroom (around same size as my entire condo) is worth in a mansion style property is completely irrelevant. Add to that the huge errors in measuring square footage and it is a fairly pointless endeavor. Not only is all real estate very local but it encompasses many submarkets. I will also add that since these figures do not include FSBO data it means that they are artificially low.
Hi Bill,
> you said: there is no good rationale for the comparison when you limit your calculations to just square foot and sales price.
Correct. We don’t just limit to sqft when setting a price, but it is something that is always looked at and given high weight.
But you’re right that in older neighborhoods such as Travis Heights or in Hyde Park where a lot will cost $250K, and many smaller homes are tear downs, we evaluate value in a different way when deciding on list price or offer price.
Nevertheless, when we see that the average price per square foot in MLS area 1B is around $280 psf, the land price is necessarily embedded in and reflected by the high price per square foot. It is a very reliable measure when the flaws are understood and other factors are properly noted.
From a practical standpoint, right or wrong, price per sqft is the language agents, buyers and sellers use when we are negotiating and determining value. On overpriced listings we often call the listing agent to ask which comps they used and how they set the value. If price per square foot can’t be validated by comparable sales of similar homes, or explained in some other way, we can’t justify to a buyer that they should set the new high water mark on price per square foot in the neighborhood. If the last 5 comparable sale in a neighborhood have been at $100 per square foot, and the seller wants to price the home at $110, I don’t need to do further analysis to know that the home won’t sell for $100. It just won’t, and average price per square foot accurately informs me of that.
In other words, it would be stupid not to look at price per square foot and compare it with that of other sales.
Square footage is the best and fastest measure of a starting value in homogeneous areas when looking at similarly sized homes, built by the same builder, in similar condition, on the same size/value lots, etc. Many, many Austin neighborhoods fit that criteria.
We do in fact see narrow ranges in homogeneous areas. That’s why it’s a lot easier to price a home correctly in bland subdivisions. On most CMA reports I will through out the high and low sale if they depart too much from the others, but we also know that in a rising market, we need to allow for upward movement in pricing.
In the broader sense of looking at ppsf, we don’t do it simply to price one home, we do it to look at trends, to compare last year to this year, to see which areas have rising values year over year and which areas are staying flat.
And along with ppsf we look at average sold price and median sold price, inventory levels, Active/Pending ratios, number of recent sales compared to number of active/pending, etc.
So, though I’m focused on price per square foot in this particular discussion, I don’t want to give the impression that it’s the only thing that matters or the only thing we look at.
Don: Thanks for your explanation. If I might add my own analogy:
Let’s say I have a bucket of sea shells. I know there are 1,000 shells in the bucket . The full bucket of 1,000 shells weighs 101 lbs and the empty bucket weighs 1 lb. So 1,000 shells weight 100 pounds. 100/1000 = 0.1.
Assume that there isn’t time to weigh and log each individual shell. You just have the total count and the weight. Is it a false statement to say that the average weight of each shell is 0.10 pounds? If so, what is the mathematical term used to describe what 0.10 represents? I thought, maybe incorrectly, it was “Harmonic Mean”.
And if I had 4 additional buckets with different amounts of shells (houses), and I wanted to know which bucket had the heaviest shells (higher ppsf) and which had the lightest (lower ppsf), wouldn’t the nonArithmetic Mean method provide useful and meaningful data so long as it’s being used consistently and applied equally?
The real question still remains, why is Arithmetic Mean the best and most proper method of measuring average price per square foot in real estate data given that it is more subject to being skewed by small variances in the data set?
I think about it like a car with extremely touchy steering. The touchy and responsive steering may technically be “better” or more “accurate”, even technically superior in every way, but does it really serve my driving needs as well as a less touchy steering wheel would?
In looking at price per square foot averages, the Arithmetic Mean may be technically more accurate, but does that level of accuracy, and “touchiness”, best serve the real estate needs of agents, buyers and sellers as we evaluate sales data, pricing and trends?
Steve
How will you determine your April sales stats. Your way or MLX? Do you know when you’ll have them out. Home sales in our neighborhood seem to have been very very slow. Curious to see if the #’s. I feel that sales could have dropped in April from March.
Thanks
Hi Jay,
I’ll have to do it the same way we always do – the nonMLX way. I’m still having conversations with MLSChange about this. It looks like their sticking with their way so once I know that’s a done decision, I’ll have to decide if I want to change how I do my own stats.
Normally I put the stats out after the 15th because it takes that long for most of the solds from the month before to be marked “sold” by the listing agents. We’re suppose to do it within 10 days after a sale.
By the 10th we’d probably have most of them, but even when I post them after the 20th I notice the numbers still change later.
Taking a quick sneak peak at the data as it stands now, the number of sales stayed about the same and the average sold price went up from $242K to $249K. That may change of course, and that’s the MLXChange averages. đź™‚
Steve
Steve
I think the mlsxchange way makes more sense and Don’s explanation is spot on. Your method weights the impact of larger houses more. The impact of a 5000 sq ft house is more than a 1000 sq ft house, but the unit that you are really looking at is on a per house basis.
The two numbers represent slightly different meanings. The MLS exchange way is the average $/sq ft of a house. Your method is the average $/sq ft of the area.
Using the first method you would be able to determine the standard deviation to see what percentage of houses fit into each range. For example, if the average is 100/sq ft and the std deviation is 10, then I could say that 66% of the houses are between $90$110/sq ft.
Going to your analogy, if I put all the shells from all buckets into the same bucket, weighed it and divided it, I have calculated the average weight/shell of shells. If I the weight/shell of each bucket and average those, I am calculating the average weight/shell of a bucket (i.e. looking at the differences between buckets vs looking at the differences between shells).