Years ago I was working on the device pricing for a mobile carrier. We used good old spreadsheets for the task. When you manage a portfolio of gadgets with a fair amount of technical specifications, it is easy to get lost trying to compare one with the other. This is why I decided not to be reliant on my memory and put Excel to work — I made it suggest which devices are similar. This is an example of an array formula that packs together several others.
I will first go through the data and explain my logic behind how I approach the comparison. If you want, you can download the example and skip to the formula explanation.
The file attached will be used as we go:
Running shoes — data overview
To show how the solution works I need some data. I assembled a list of running shoes that I already own (and use as intended) or I can find in outlets near me. I got their properties from RunRepeat:
- Some characteristics (eg. weight, drop) are purely technical;
- Others, like breathability and cushioning, are subjective.
- Where data/scoring was missing I checked reviews or used my knowledge of trying the shoes in person.
Smartphones have a finite set of technical features to put against each other. Moreover, most customers actually compare a few of them. Shoes, on the other hand, being something your wear, are a bit more tricky. The feeling of them on your feet can vary widely. This is why I decided to list only major data points and skip some details. The suggestion produced will guide me about what to go out and try. As a result, I am likely to save time by not having to visit all shops.
There are some important characteristics, like price and colour availability, that surely make a lot of difference. However, they are subject to current circumstances. I assume that when putting the list together one has already filtered out the options that will not be suitable with regard to cost or style.
Lastly, I intentionally threw in a mix of shoes that cover several extremes. This will make it easier to show the differences between them and how similar items are being put together.
Notes on the methodology
As with every comparison, a certain degree of subjectivity is involved. Once you define what and how much is important to you, a similarity score can be calculated. It will put all subjects against each other to arrive at the best pairs. There are some remarks to be made:
- Internal correlations between features are not considered. In the case of running shoes, higher cushioning is very likely to be related to a thicker sole unit. Rubber/foam technologies (and the laws of physics altogether) are not different enough to allow big differences. However, cushioning and thickness are scored separately.
- The most similar sneaker is returned (as an overall). Still, when it comes to a particular property, shoes can be very different. Not only this but the kicks that come in the first place might not be the most similar with regard to any particular comparison.
- I am not aiming for a certain scale when computing the score (eg. 100 points). Of course, the mathematics behind can be fine-tuned for that but since I do not intend to communicate the score, it makes no sense to put the effort. After all, that number is merely a tool to rank the similarity.
First look and cleaning the data
As I already mentioned, I did not list the most granular data available. However, as array formulas can put a toll on the CPU, it is good practice to first consider what fields we are going to use. This is why on the First look sheet I initially check the records and trim them.
Arch type and Arch support are important properties. Yet, looking through the rows I notice how little variability there is in them. This comes to show that unless we look for a specific type of shoe, chances are we will get one with neutral, high-arch support. The parameters would be useful for someone but I will ignore them in the scoring.
We can say the same about Pronation — with only 2 of the shoes not neutral we can skip this field. So far I haven’t noticed much difference whether I run with shoes correcting the overpronation or not. Thus, I don’t assign much value here. Others will be in need of one such shoe so this is something that might significantly affect the score.
Furthermore, there are 4 columns to describe what use fits best to the sneakers. If we look at the flags, however, it is clear that those come down to whether the shoe is a daily trainer or race day gear. Because of this, we can safely collapse the data into one column (Competition) that will bear Y/N flag.
*In the real-life I would ignore the redundant fields. In order to make this post easier to follow, I will delete them and work on a separate sheet — Finding similar items.
The points matter
There are several types of ranking involved. For each of them, we try to give more points when a pair of shoes are close to each other in terms of characteristics. Alternatively, we can involve a penalty if kicks are different.
Regardless of giving or taking away points, their amount represents the importance we give to the corresponding shoe characteristic.Click to see how and why I’m setting the points.
Measures on a nominal scale
- Brand — We tend to stick with a brand and products of the same vendor can be expected to have similar qualities. In the case of running shoes, Asics will use Gel and Adidas will pump Boost for example. This is why I assign 0,2 points if the same brand is present.
- Terrain — Sneakers made for road and trail have different build and feel. This is so pronounced that I am willing to give an entire point towards the similarity score when we have a match.
- Strike pattern — No matter how much work we put into our running technique, most of us remain heel strikers. In my experience, however, there is not so much difference between the Heel & Midfoot categories. I’m setting the rule at 0,5 points if shoes are labelled the same.
- Competition — Racing flats are usually lighter and less cushioned. Unlike the daily trainers, they will be kept for speed runs. I’m assigning 1 pt to the rule as I lack the experience of ever trying competition shoes. Might be an overshot (I will comment in the Next steps section).
- Waterproofing — This is related to the presence of membrane which, to me, sets the sneakers in different categories. 1 point should be enough to set apart the waterproof shoes.
Measures on a ratio scale
- Weight — I penalize ‑0,1 pt for every 10 full grams of difference in the weight. I am not that preoccupied with weight.
- Heel & Forefoot height — I was considering to ignore those measures and rely on the drop only. Yet, thicker shoes feel different than ones with slim sole unit. Therefore, I assign a very modest ‑0,01 pt/mm penalty.
- Drop — This turned out to be quite a difference for me. Consequently, I put a rather heavy penalty of ‑0,1 pt per mm difference.
Measures on an interval scale
A 7‑step scale (very low — very high) is used.
- Breathability — dependent on the context. Sneakers layered with Gore-Tex (or another membrane) don’t breath as much as mesh-only kicks. However, a shoe can be very well breathable for an insulated option. I am relying on the Waterproof property to split the groups. Then, I will subtract 0,2 pt for every level of difference.
- Cushioning — Similarly to how waterproofing affects breathability, being a competition shoe or not will affect cushioning. While watching reviews I noticed how race day kicks are praised for their padding. It was so despite racing flats having in general less foam than the daily runners. This is why I will rely on the Competition flag to keep the two groups apart and then withdraw ‑0,2 pt per cushioning bucket difference.
It only looks complicated. I promise.
This is an elaborated INDEX/MATCH. What we index is the column with shoe names. Getting the appropriate row is matching the highest similarity score among an array with all of them. The best score is found by the =LARGE() function.
However, working the formula inside-out always works best so let’s do it.
Constructing the scoring array
We can measure every property of a shoe against the same feature of the other sneakers. In the process, we can attach the scores outlined above. Summing the resulting points for every shoe would produce a scoring array. That is what we achieve with this block of the formula:
A handful of things to note:
- As I work in a table, structured referencing is used.
- For all score input cells, I have created a named range so it is easier to know what figures are referenced.
- All sub-formulas are bracketed in =ROUND() as this ensures the solution would work fine.
I have faced a problem with array formulas (and sometimes regular ones) that usually doesn’t really matter but this time it was of importance. Occasionally Excel would turn the result of a simple mathematical operation with integers into a decimal with many digits after the dot. Usually, we look at results that are rounded (at least for display purposes) so this would not be a problem. However, in the context of the task we tackle, even the slightest difference counts. I am sure this situation is well documented somewhere but I didn’t have the time to explorer.
Lets check what happens inside the block:
For nominal values — marked with (1)
Applicable to all columns that have nominal values in them. The formula would loop through the records and take the scoring assigned if the same label is present. For example, if we are at the row for Asics Gel Sonoma 3 GTX (which is waterproof), this part
would give 1 point to the other shoes with membrane and nothing to the rest of them.
For ratio scales — marked with (2)
Pretty much the same happens when we deal with data on a ratio measurement scale. However, instead of giving points or not, we multiply the award/penalty times the levels of difference.
As Brooks Ghost 12 has one of the highest drops (12 mm), employing the formula
ABS([@[Drop (mm)]]-[Drop (mm)])*Drop_Points
would take the modulus of the drop difference in mm and multiply it with the ‑0.1 pt assigned. The Altra Escalante 2.0 has no heel-to-toes offset and will bear 12*-0.1=-1.2 pt toll. Nevertheless, the Mizuno Wave Rider 23 has the same 12 mm offset and will not be penalized, hence, helping it achieve better similarity score.
In the case of weight, I add another layer of complication. I consider 10 g to be the step at which difference in mass can be sensed. Therefore, I quantize the weight inequality with the =QUOTIENT() function.
QUOTIENT(ABS([@[Weight (g)]]-[Weight (g)]);10)*Weight_Points
If I didn’t care there to be 10 full grams of difference, I could pass on the extra step and adjust the scoring rule 10 times. It wasn’t going to be precisely the same but it was still going to work. It is up to your own preferences.
In the case of ordinal scale — marked with (3)
The two properties measured on an ordinal scale require extra step as well. As Excel can’t process natural language, we need to direct it on how to handle the levels of difference between the options. There are various ways to do this and I have opted for the simplest of them. On sheet Ordinal Scale I placed all categories in their respective order. Then I =MATCH()-ed shoe data with the options to have the respective position for each sneaker. The difference in placements times the assigned score would be what we use for the scoring:
As an alternative, we can use =CHOOSE() for example. However, I don’t think it will add any benefit.
To fine-tune the results — marked with (4)
After we have summed all the scores from the segments above, we can attach couple of raiders that will precise the result.
- The first condition is the shoe name for which we make the comparison to not match the one in the full array. This is set because we run the formula on all rows. Comparing one shoe with itself will certainly place it very high in the ranking.
- The last line is a control that will not allow shoes that I already own to show up as similar propositions. Admittedly, this is not following strictly the spirit of the task. Yet, it makes sense for someone to stick with similar sneakers. This will cluster them together and will not allow the others, not owned, to pop-up.
In both cases we can notice that the checking the condition would return TRUE/FALSE:
*([Shoe Name]<>[@[Shoe Name]])
Multiplying the score from the other checks with TRUE will leave it the same. However, the product of FALSE and decimal number would be FALSE too — a property that we will need in the next step.
Taking it to the end
If in the full formula we replace the entire scoring array with a placeholder, it will fold into something much more manageable:
This is easily digestible:
- (A) — The =LARGE() function will look through the scoring array, ignore the FALSE values and return the biggest (or second biggest, etc.) value.
- (B) — It will be then =MATCH()-ed against the very same scoring array it was extracted from.
Knowing the position of the best performing shoe, we obtain its name by indexing the column with names.
*As always — array formulas are entered with the [Ctrl]+[Shift]+[Enter] combo.
**Notice how there is no error handling around the formula. I leave it this way because if something is not working fine, I want to know it immediately. Bracketing it with =IFERROR() would mask the issue.
Expanding the calculations
On sheet Single Shoe Expanded we can see how the actions described above work for a single pair of kicks. If we select a shoe, the spreadsheet would draw all scores for other sneakers vs the chosen one:
Then the numbers will be summed and result into the scoring array in the Total Similarity Score column. The green rows and prefixes show which shoes are placed the highest.
All the calculations on this sheet are performed for every row of data on the Finding similar items sheet. This is what makes array formulas elegant but puts a hefty amount of pressure on the processor. In fact — the array formula has to calculate the entire scoring array twice so that results in even more CPU cycles.
Taking a look at the results
Checking the result, I am mostly satisfied:
- The Adidas UltraBoost 20 dominates when comparing to my warm weather sneakers. This is in line with my expectations. The previous editions were shoes I have considered buying.
Moreover, the Brooks Ghost 12 and Mizuno Wave Rider 23 both got my attention when composing the list.
- For shoes that I don’t own, propositions seem to be OK as well. Race day shoes or waterproofed ones are likely to be put against each other, for example. Also, pairs of similar shoes (eg. the Asics Gel Sonoma 4 GTX and the Salomon Speedcross Vario 2 GTX) tend to be suggested against each other. This situation can be expected given the length of the list.
At the same time some shortcomings are being evident as well:
- The Vibram FiveFingers V‑Run is rather unique. In fact, I would say that it is nothing like the other shoes. Nevertheless, the formula does not have prejudice and will make suggestions. These might not be regarded as useful or even misinforming.
- Seeing the Asics Gel Kinsei 6 listed as one of the options most similar to the Asics Gel Sonoma 4 GTX is raising a red flag. Not only there are other waterproof shoes but one of them, Asics Gel FujiTrabuco 7 GTX, is from the same manufacturer. This needs further investigation.
But… We have another problem
If we select the Brooks Caldera 4 on the Single Shoe Expanded sheet, we will notice something odd:
Both the Brooks Ghost 12 and the Adidas UltraBoost 20 are having the same score. Otherwise, it will set any of them as the 3rd most similar pair to the Caldera.
Even worse, due to how the =INDEX() works, if the 2nd place Adidas Adizero Boston 8 was not there, the same shoe was going to be suggested two times.
Adding a tiebreaker rule
One can think of different ways to overcome the issue. What I did can be seen on the sheet Finding similar items Tiebreak. It is the simplest possible approach and adds the following to the scoring array block:
I am taking the row of every pair and multiply it by a factor that is small enough that the product of the two is no bigger than the smallest scoring increment. This creates a tiny drift that will set apart sneakers with otherwise same scoring.
*I am obviously not making this perfectly future proof. Rows past 1000 might significantly affect the scoring. However, I don’t expect to go this far and using hard input, despite being bad, is not much of a problem here.
At this point we have a working solution in Excel that would suggest similar items based on their properties. Yet, such tools are seldom made for one-off use and work should carry on.
- Feedback — It is important to compare the formula output with the real-world result. In the case of running shoes, I will be minding the results next time I am shopping.
- Adjustments — If there is room for improvement, use it. This might require adjustments to the scoring; extra characteristic to be included; or something to be dropped.
- Consider going from passive comparison to active ranking. Chances are you are not interested in what items are most similar but which are better. In this case, we must make changes to the scoring system. For example, shoes are better light than heavy. Instead of penalizing the weight difference, we can award points when grams have been shaved off.
- Make it more efficient. I still don’t use Office 365 but recently the =LET() function was introduced to it. With the help of it, you can calculate the entire scoring array block only once and make the entire formula more readable.
Whatever it is, you have the starting point. Go make it better.