Finding similar items in Excel based on their properties

Years ago I was work­ing on the device pric­ing for a mobile car­ri­er. We used good old spread­sheets for the task. When you man­age a port­fo­lio of gad­gets with a fair amount of tech­ni­cal spec­i­fi­ca­tions, it is easy to get lost try­ing to com­pare one with the oth­er. This is why I decid­ed not to be reliant on my mem­o­ry and put Excel to work — I made it sug­gest which devices are sim­i­lar. This is an exam­ple of an array for­mu­la that packs togeth­er sev­er­al others.


I will first go through the data and explain my log­ic behind how I approach the com­par­i­son. If you want, you can down­load the exam­ple and skip to the for­mu­la expla­na­tion.


The file attached will be used as we go:


Running shoes — data overview

To show how the solu­tion works I need some data. I assem­bled a list of run­ning shoes that I already own (and use as intend­ed) or I can find in out­lets near me. I got their prop­er­ties from Run­Re­peat:

  • Some char­ac­ter­is­tics (eg. weight, drop) are pure­ly technical;
  • Oth­ers, like breatha­bil­i­ty and cush­ion­ing, are subjective.
  • Where data/scoring was miss­ing I checked reviews or used my knowl­edge of try­ing the shoes in person.

Smart­phones have a finite set of tech­ni­cal fea­tures to put against each oth­er. More­over, most cus­tomers actu­al­ly com­pare a few of them. Shoes, on the oth­er hand, being some­thing your wear, are a bit more tricky. The feel­ing of them on your feet can vary wide­ly. This is why I decid­ed to list only major data points and skip some details. The sug­ges­tion pro­duced will guide me about what to go out and try. As a result, I am like­ly to save time by not hav­ing to vis­it all shops.

There are some impor­tant char­ac­ter­is­tics, like price and colour avail­abil­i­ty, that sure­ly make a lot of dif­fer­ence. How­ev­er, they are sub­ject to cur­rent cir­cum­stances. I assume that when putting the list togeth­er one has already fil­tered out the options that will not be suit­able with regard to cost or style.

Last­ly, I inten­tion­al­ly threw in a mix of shoes that cov­er sev­er­al extremes. This will make it eas­i­er to show the dif­fer­ences between them and how sim­i­lar items are being put together.

Notes on the methodology

As with every com­par­i­son, a cer­tain degree of sub­jec­tiv­i­ty is involved. Once you define what and how much is impor­tant to you, a sim­i­lar­i­ty score can be cal­cu­lat­ed. It will put all sub­jects against each oth­er to arrive at the best pairs. There are some remarks to be made:

  • Inter­nal cor­re­la­tions between fea­tures are not con­sid­ered. In the case of run­ning shoes, high­er cush­ion­ing is very like­ly to be relat­ed to a thick­er sole unit. Rubber/foam tech­nolo­gies (and the laws of physics alto­geth­er) are not dif­fer­ent enough to allow big dif­fer­ences. How­ev­er, cush­ion­ing and thick­ness are scored separately.
  • The most sim­i­lar sneak­er is returned (as an over­all). Still, when it comes to a par­tic­u­lar prop­er­ty, shoes can be very dif­fer­ent. Not only this but the kicks that come in the first place might not be the most sim­i­lar with regard to any par­tic­u­lar comparison.
  • I am not aim­ing for a cer­tain scale when com­put­ing the score (eg. 100 points). Of course, the math­e­mat­ics behind can be fine-tuned for that but since I do not intend to com­mu­ni­cate the score, it makes no sense to put the effort. After all, that num­ber is mere­ly a tool to rank the similarity.
First look and cleaning the data

As I already men­tioned, I did not list the most gran­u­lar data avail­able. How­ev­er, as array for­mu­las can put a toll on the CPU, it is good prac­tice to first con­sid­er what fields we are going to use. This is why on the First look sheet I ini­tial­ly check the records and trim them.

Arch support and arch type are having very little variability.

Arch type and Arch sup­port are impor­tant prop­er­ties. Yet, look­ing through the rows I notice how lit­tle vari­abil­i­ty there is in them. This comes to show that unless we look for a spe­cif­ic type of shoe, chances are we will get one with neu­tral, high-arch sup­port. The para­me­ters would be use­ful for some­one but I will ignore them in the scoring.

Pronation is seldom different from neutral.

We can say the same about Prona­tion — with only 2 of the shoes not neu­tral we can skip this field. So far I haven’t noticed much dif­fer­ence whether I run with shoes cor­rect­ing the over­prona­tion or not. Thus, I don’t assign much val­ue here. Oth­ers will be in need of one such shoe so this is some­thing that might sig­nif­i­cant­ly affect the score.

Detailed breakdown of distances is not needed as running shoes would fall into the daily trainers or race day category.

Fur­ther­more, there are 4 columns to describe what use fits best to the sneak­ers. If we look at the flags, how­ev­er, it is clear that those come down to whether the shoe is a dai­ly train­er or race day gear. Because of this, we can safe­ly col­lapse the data into one col­umn (Com­pe­ti­tion) that will bear Y/N flag.

*In the real-life I would ignore the redun­dant fields. In order to make this post eas­i­er to fol­low, I will delete them and work on a sep­a­rate sheet — Find­ing sim­i­lar items.

The points matter

There are sev­er­al types of rank­ing involved. For each of them, we try to give more points when a pair of shoes are close to each oth­er in terms of char­ac­ter­is­tics. Alter­na­tive­ly, we can involve a penal­ty if kicks are different.

Regard­less of giv­ing or tak­ing away points, their amount rep­re­sents the impor­tance we give to the cor­re­spond­ing shoe characteristic.

Arbitrary points awarded.
We will be award­ing points for all fields we are left with.
Click to see how and why I’m set­ting the points.
Measures on a nominal scale
  • Brand — We tend to stick with a brand and prod­ucts of the same ven­dor can be expect­ed to have sim­i­lar qual­i­ties. In the case of run­ning shoes, Asics will use Gel and Adi­das will pump Boost for exam­ple. This is why I assign 0,2 points if the same brand is present.
  • Ter­rain — Sneak­ers made for road and trail have dif­fer­ent build and feel. This is so pro­nounced that I am will­ing to give an entire point towards the sim­i­lar­i­ty score when we have a match.
  • Strike pat­tern — No mat­ter how much work we put into our run­ning tech­nique, most of us remain heel strik­ers. In my expe­ri­ence, how­ev­er, there is not so much dif­fer­ence between the Heel & Mid­foot cat­e­gories. I’m set­ting the rule at 0,5 points if shoes are labelled the same.
  • Com­pe­ti­tion — Rac­ing flats are usu­al­ly lighter and less cush­ioned. Unlike the dai­ly train­ers, they will be kept for speed runs. I’m assign­ing 1 pt to the rule as I lack the expe­ri­ence of ever try­ing com­pe­ti­tion shoes. Might be an over­shot (I will com­ment in the Next steps section).
  • Water­proof­ing — This is relat­ed to the pres­ence of mem­brane which, to me, sets the sneak­ers in dif­fer­ent cat­e­gories. 1 point should be enough to set apart the water­proof shoes.
Measures on a ratio scale
  • Weight — I penal­ize ‑0,1 pt for every 10 full grams of dif­fer­ence in the weight. I am not that pre­oc­cu­pied with weight. 
  • Heel & Fore­foot height — I was con­sid­er­ing to ignore those mea­sures and rely on the drop only. Yet, thick­er shoes feel dif­fer­ent than ones with slim sole unit. There­fore, I assign a very mod­est ‑0,01 pt/mm penalty.
  • Drop — This turned out to be quite a dif­fer­ence for me. Con­se­quent­ly, I put a rather heavy penal­ty of ‑0,1 pt per mm difference.
Measures on an interval scale

A 7‑step scale (very low — very high) is used.

  • Breatha­bil­i­ty — depen­dent on the con­text. Sneak­ers lay­ered with Gore-Tex (or anoth­er mem­brane) don’t breath as much as mesh-only kicks. How­ev­er, a shoe can be very well breath­able for an insu­lat­ed option. I am rely­ing on the Water­proof prop­er­ty to split the groups. Then, I will sub­tract 0,2 pt for every lev­el of difference.
  • Cush­ion­ing — Sim­i­lar­ly to how water­proof­ing affects breatha­bil­i­ty, being a com­pe­ti­tion shoe or not will affect cush­ion­ing. While watch­ing reviews I noticed how race day kicks are praised for their padding. It was so despite rac­ing flats hav­ing in gen­er­al less foam than the dai­ly run­ners. This is why I will rely on the Com­pe­ti­tion flag to keep the two groups apart and then with­draw ‑0,2 pt per cush­ion­ing buck­et difference.
Don’t panic!

It only looks com­pli­cat­ed. I promise.

Formula for finding similar items.

This is an elab­o­rat­ed INDEX/MATCH. What we index is the col­umn with shoe names. Get­ting the appro­pri­ate row is match­ing the high­est sim­i­lar­i­ty score among an array with all of them. The best score is found by the =LARGE() function.

How­ev­er, work­ing the for­mu­la inside-out always works best so let’s do it.

Constructing the scoring array

We can mea­sure every prop­er­ty of a shoe against the same fea­ture of the oth­er sneak­ers. In the process, we can attach the scores out­lined above. Sum­ming the result­ing points for every shoe would pro­duce a scor­ing array. That is what we achieve with this block of the formula:

Scoring array block of the formula.

A hand­ful of things to note:

  • As I work in a table, struc­tured ref­er­enc­ing is used.
  • For all score input cells, I have cre­at­ed a named range so it is eas­i­er to know what fig­ures are referenced.
  • All sub-for­mu­las are brack­et­ed in =ROUND() as this ensures the solu­tion would work fine.
Click here if you care why I had to use =ROUND().

I have faced a prob­lem with array for­mu­las (and some­times reg­u­lar ones) that usu­al­ly does­n’t real­ly mat­ter but this time it was of impor­tance. Occa­sion­al­ly Excel would turn the result of a sim­ple math­e­mat­i­cal oper­a­tion with inte­gers into a dec­i­mal with many dig­its after the dot. Usu­al­ly, we look at results that are round­ed (at least for dis­play pur­pos­es) so this would not be a prob­lem. How­ev­er, in the con­text of the task we tack­le, even the slight­est dif­fer­ence counts. I am sure this sit­u­a­tion is well doc­u­ment­ed some­where but I did­n’t have the time to explorer.

Lets check what hap­pens inside the block:

For nominal values — marked with (1)

Applic­a­ble to all columns that have nom­i­nal val­ues in them. The for­mu­la would loop through the records and take the scor­ing assigned if the same label is present. For exam­ple, if we are at the row for Asics Gel Sono­ma 3 GTX (which is water­proof), this part

IF([@Waterproofing]=[Water­proof­ing];Waterproffing_Points;0)

would give 1 point to the oth­er shoes with mem­brane and noth­ing to the rest of them.

For ratio scales — marked with (2)

Pret­ty much the same hap­pens when we deal with data on a ratio mea­sure­ment scale. How­ev­er, instead of giv­ing points or not, we mul­ti­ply the award/penalty times the lev­els of difference.

As Brooks Ghost 12 has one of the high­est drops (12 mm), employ­ing the formula

ABS([@[Drop (mm)]]-[Drop (mm)])*Drop_Points

would take the mod­u­lus of the drop dif­fer­ence in mm and mul­ti­ply it with the ‑0.1 pt assigned. The Altra Escalante 2.0 has no heel-to-toes off­set and will bear 12*-0.1=-1.2 pt toll. Nev­er­the­less, the Mizuno Wave Rid­er 23 has the same 12 mm off­set and will not be penal­ized, hence, help­ing it achieve bet­ter sim­i­lar­i­ty score.

In the case of weight, I add anoth­er lay­er of com­pli­ca­tion. I con­sid­er 10 g to be the step at which dif­fer­ence in mass can be sensed. There­fore, I quan­tize the weight inequal­i­ty with the =QUOTIENT() function.

QUOTIENT(ABS([@[Weight (g)]]-[Weight (g)]);10)*Weight_Points

If I did­n’t care there to be 10 full grams of dif­fer­ence, I could pass on the extra step and adjust the scor­ing rule 10 times. It was­n’t going to be pre­cise­ly the same but it was still going to work. It is up to your own preferences.

In the case of ordinal scale — marked with (3)

The two prop­er­ties mea­sured on an ordi­nal scale require extra step as well. As Excel can’t process nat­ur­al lan­guage, we need to direct it on how to han­dle the lev­els of dif­fer­ence between the options. There are var­i­ous ways to do this and I have opt­ed for the sim­plest of them. On sheet Ordi­nal Scale I placed all cat­e­gories in their respec­tive order. Then I =MATCH()-ed shoe data with the options to have the respec­tive posi­tion for each sneak­er. The dif­fer­ence in place­ments times the assigned score would be what we use for the scoring:

ABS(MATCH([@Breathability];Ordinal_Scale;0)-MATCH([Breatha­bil­i­ty];Ordinal_Scale;0))*Breathability_Points

As an alter­na­tive, we can use =CHOOSE() for exam­ple. How­ev­er, I don’t think it will add any benefit.

To fine-tune the results — marked with (4)

After we have summed all the scores from the seg­ments above, we can attach cou­ple of raiders that will pre­cise the result.

  • The first con­di­tion is the shoe name for which we make the com­par­i­son to not match the one in the full array. This is set because we run the for­mu­la on all rows. Com­par­ing one shoe with itself will cer­tain­ly place it very high in the ranking.
  • The last line is a con­trol that will not allow shoes that I already own to show up as sim­i­lar propo­si­tions. Admit­ted­ly, this is not fol­low­ing strict­ly the spir­it of the task. Yet, it makes sense for some­one to stick with sim­i­lar sneak­ers. This will clus­ter them togeth­er and will not allow the oth­ers, not owned, to pop-up.

In both cas­es we can notice that the check­ing the con­di­tion would return TRUE/FALSE:

*([Shoe Name]<>[@[Shoe Name]])
*([Owned]<>“Y”)

Mul­ti­ply­ing the score from the oth­er checks with TRUE will leave it the same. How­ev­er, the prod­uct of FALSE and dec­i­mal num­ber would be FALSE too — a prop­er­ty that we will need in the next step.

Taking it to the end

If in the full for­mu­la we replace the entire scor­ing array with a place­hold­er, it will fold into some­thing much more manageable:

Indexing the scoring array - finding the highest score with the LARGE function.
=INDEX([Shoe Name];MATCH(LARGE((SCORING_ARRAY;1);SCORING_ARRAY;0);1)

This is eas­i­ly digestible:

  • (A) — The =LARGE() func­tion will look through the scor­ing array, ignore the FALSE val­ues and return the biggest (or sec­ond biggest, etc.) value.
  • (B) — It will be then =MATCH()-ed against the very same scor­ing array it was extract­ed from.

Know­ing the posi­tion of the best per­form­ing shoe, we obtain its name by index­ing the col­umn with names.

*As always — array for­mu­las are entered with the [Ctrl]+[Shift]+[Enter] combo.

**Notice how there is no error han­dling around the for­mu­la. I leave it this way because if some­thing is not work­ing fine, I want to know it imme­di­ate­ly. Brack­et­ing it with =IFERROR() would mask the issue.

Expanding the calculations

On sheet Sin­gle Shoe Expand­ed we can see how the actions described above work for a sin­gle pair of kicks. If we select a shoe, the spread­sheet would draw all scores for oth­er sneak­ers vs the cho­sen one:

Expanding the array formula's calculations to a spreadsheet.

Then the num­bers will be summed and result into the scor­ing array in the Total Sim­i­lar­i­ty Score col­umn. The green rows and pre­fix­es show which shoes are placed the highest.

All the cal­cu­la­tions on this sheet are per­formed for every row of data on the Find­ing sim­i­lar items sheet. This is what makes array for­mu­las ele­gant but puts a hefty amount of pres­sure on the proces­sor. In fact — the array for­mu­la has to cal­cu­late the entire scor­ing array twice so that results in even more CPU cycles.

Taking a look at the results
Finding similar items - looking at the result.
Results are most­ly satisfying

Check­ing the result, I am most­ly satisfied:

  • The Adi­das Ultra­Boost 20 dom­i­nates when com­par­ing to my warm weath­er sneak­ers. This is in line with my expec­ta­tions. The pre­vi­ous edi­tions were shoes I have con­sid­ered buy­ing.
    More­over, the Brooks Ghost 12 and Mizuno Wave Rid­er 23 both got my atten­tion when com­pos­ing the list.
  • For shoes that I don’t own, propo­si­tions seem to be OK as well. Race day shoes or water­proofed ones are like­ly to be put against each oth­er, for exam­ple. Also, pairs of sim­i­lar shoes (eg. the Asics Gel Sono­ma 4 GTX and the Salomon Speed­cross Vario 2 GTX) tend to be sug­gest­ed against each oth­er. This sit­u­a­tion can be expect­ed giv­en the length of the list.

At the same time some short­com­ings are being evi­dent as well:

  • The Vibram FiveFin­gers V‑Run is rather unique. In fact, I would say that it is noth­ing like the oth­er shoes. Nev­er­the­less, the for­mu­la does not have prej­u­dice and will make sug­ges­tions. These might not be regard­ed as use­ful or even misinforming.
  • See­ing the Asics Gel Kin­sei 6 list­ed as one of the options most sim­i­lar to the Asics Gel Sono­ma 4 GTX is rais­ing a red flag. Not only there are oth­er water­proof shoes but one of them, Asics Gel Fuji­Tra­bu­co 7 GTX, is from the same man­u­fac­tur­er. This needs fur­ther investigation.
But… We have another problem

If we select the Brooks Caldera 4 on the Sin­gle Shoe Expand­ed sheet, we will notice some­thing odd:

Two pairs of shoes are ranked at the same place.

Both the Brooks Ghost 12 and the Adi­das Ultra­Boost 20 are hav­ing the same score. Oth­er­wise, it will set any of them as the 3rd most sim­i­lar pair to the Caldera.

Even worse, due to how the =INDEX() works, if the 2nd place Adi­das Adize­ro Boston 8 was not there, the same shoe was going to be sug­gest­ed two times.

Adding a tiebreaker rule

One can think of dif­fer­ent ways to over­come the issue. What I did can be seen on the sheet Find­ing sim­i­lar items Tiebreak. It is the sim­plest pos­si­ble approach and adds the fol­low­ing to the scor­ing array block:

+ROW([Shoe Name])*0,00001)

I am tak­ing the row of every pair and mul­ti­ply it by a fac­tor that is small enough that the prod­uct of the two is no big­ger than the small­est scor­ing incre­ment. This cre­ates a tiny drift that will set apart sneak­ers with oth­er­wise same scoring.

*I am obvi­ous­ly not mak­ing this per­fect­ly future proof. Rows past 1000 might sig­nif­i­cant­ly affect the scor­ing. How­ev­er, I don’t expect to go this far and using hard input, despite being bad, is not much of a prob­lem here.

Next steps

At this point we have a work­ing solu­tion in Excel that would sug­gest sim­i­lar items based on their prop­er­ties. Yet, such tools are sel­dom made for one-off use and work should car­ry on.

  • Feed­back — It is impor­tant to com­pare the for­mu­la out­put with the real-world result. In the case of run­ning shoes, I will be mind­ing the results next time I am shopping.
  • Adjust­ments — If there is room for improve­ment, use it. This might require adjust­ments to the scor­ing; extra char­ac­ter­is­tic to be includ­ed; or some­thing to be dropped.
  • Con­sid­er going from pas­sive com­par­i­son to active rank­ing. Chances are you are not inter­est­ed in what items are most sim­i­lar but which are bet­ter. In this case, we must make changes to the scor­ing sys­tem. For exam­ple, shoes are bet­ter light than heavy. Instead of penal­iz­ing the weight dif­fer­ence, we can award points when grams have been shaved off.
  • Make it more effi­cient. I still don’t use Office 365 but recent­ly the =LET() func­tion was intro­duced to it. With the help of it, you can cal­cu­late the entire scor­ing array block only once and make the entire for­mu­la more readable.

What­ev­er it is, you have the start­ing point. Go make it better.