Extracting unique items from list of entries

Before all — Microsoft made life much eas­i­er by intro­duc­ing the =UNIQUE() func­tion in Excel 365. It is a pow­er­ful tool with one short­com­ing — it is not back­wards com­pat­i­ble with old­er ver­sions. This means that even as of the moment of writ­ing (Feb 2020) there is a siz­able share of non-upgrad­ed user who can’t take advan­tage of the func­tion­al­i­ty — not only the for­mu­la is absent when they start typ­ing, but it will also resolve in error if a file with it is sent to them.

This is why the issue is approached old-school by rely­ing on for­mu­las avail­able in Office 2007 and lat­er. Fur­ther­more, stick­ing with the com­pli­cat­ed for­mu­la has some flex­i­bil­i­ty ben­e­fit that will be demon­strat­ed as well.


The fol­low­ing file is used for the pur­pos­es of the tutorial:


One of the fea­tures of the split­ting tool is the option to list the routes tak­en as described by their points of orig­i­na­tion and des­ti­na­tion (sep­a­rate­ly). Total length in km will be present as well for lat­er use. This is an input on the sheet Routes.

Since users need be able to use drop-down lists on sheet Cal­cu­la­tor to choose start­ing and end­ing points of the jour­ney, cou­ple of for­mu­las must be present:

  • The first goes through the list of orig­i­na­tion points and lists the unique val­ues among them. This way the user will not see dupli­cate val­ues in the drop-down menu.
  • The sec­ond one accounts for what has been cho­sen as start­ing place and goes through the des­ti­na­tions’ list extract­ing the unique items relat­ed to the trip ori­gin only.

For bet­ter read­abil­i­ty of the for­mu­las the range of cells where routes are logged (Sheet Routes) are con­vert­ed into a table named Routes­List. This step is not nec­es­sary but makes the work a bit more future proof. If all the rows in the table have been filled-in and add new entry is made on the line just beneath it, the table will auto­mat­i­cal­ly expand. The way tables are ref­er­enced (as demon­strat­ed lat­er) means that we no for­mu­las will need be updat­ed to accom­mo­date the extra lines of data.

The gen­er­al for­mu­la would be exam­ined inside out as it is being solved by Excel.

=IFERROR(LOOKUP(2;1/((COUNTIF($E$3:E3;RoutesList[From])=0)*(RoutesList[From]<>“”));RoutesList[From]);”-”)

At first the COUNTIF($E$3:E3;RoutesList[From])=0) is studied.

Marked with (1) is the short list of unique items. This is a run­ning ref­er­ence so as the for­mu­la is expand­ed it will per­form its func­tion against an increas­ing range of cells.

Marked with (2) is the routes’ table From entries. Unlike the pre­vi­ous range, this one is tak­en as a whole regard­less of how far the for­mu­la has been extended.

On every instance when the func­tion is run the the two argu­ments com­bined would return inte­ger val­ue for every row with trip orig­i­na­tion points. This will vary between 0, if the record is not present in the unique list yet, and 1 — if the item has already been listed.

Click to see how it looks

Then at the point marked with (3) the pro­duced stream of results is com­pared with the val­ue of 0. This will return TRUE/FALSE val­ues for every 0/1.

Click to see how it looks

Next in the big for­mu­la, by exe­cut­ing the (RoutesList[From]<>“”) part, a test is per­formed whether the records in the From field of the long list are not emp­ty. This is made so even if not all table rows are filled-in, the blanks will not mak­ing it to the short list.

As the ref­er­ence to the col­umn sug­gests — on every instance when we run the full for­mu­la all records are ref­er­enced. This is a log­i­cal test and will return TRUE/FALSE value.

Click to see how it looks

As the brack­ets guide, next the two streams of TRUE/FALSE val­ues (described above) are being mul­ti­plied. For each row in the long list, a new val­ue will be produced.

Mul­ti­pli­ca­tion of log­i­cal val­ues is not exact­ly straight­for­ward. Nat­u­ral­ly, if all of the argu­ments are TRUE, then the out­come is TRUE as well. How­ev­er, Excel would not resolve the TRUE*TRUE prod­uct as TRUE but as a numer­ic val­ue of 1 instead. If any of the fac­tors is FALSE (eg. TRUE*FALSE or FALSE*FALSE) then the result will be 0. Excel assumes that a log­i­cal test that is TRUE rep­re­sents an inte­ger val­ue of 1; and FALSE — 0. Mul­ti­ply­ing any num­ber of those would return 1 if all argu­ments are equal to 1; but will be equal to 0 if any of the argu­ments is 0. 

1st argu­ment2nd argu­mentMul­ti­pli­ca­tionResult
TRUETRUETRUE*TRUE or 1*1=1
TRUEFALSETRUE*FALSE or 1*0=0
FALSEFALSEFALSE*FALSE or 0*0=0
Click to see how it works out with the exam­ple

Next is the divi­sion of 1 by the 1/0 stream obtained in the pre­vi­ous step. 

Hav­ing 0 in the denom­i­na­tor is clear­ly wrong math­e­mat­i­cal­ly and for all instances this is tried Excel will return #DIV/0! error. Usu­al­ly this will be great incon­ve­nience but in this case works in our advantage.

Click to see how the divi­sion trans­lates as results

This might be obvi­ous but divid­ing 1 by 1 or 0 will always result into 1 or error (#DIV/0!). No oth­er results are possible.

At this point the =LOOKUP() func­tion comes into play. 

The entire expres­sion of

LOOKUP(2;1/((COUNTIF($E$3:E3;RoutesList[From])=0)*(RoutesList[From]<>“”));RoutesList[From])

has been turned into

LOOKUP(2;Stream of 1s and #DIV/0! errors;RoutesList[From])

The high­light­ed stream is essen­tial­ly a prop­er­ty of the items in the From col­umn of the table (RoutesList[From]) and tells if

  • the item is already record­ed in the short list or is a blank val­ue — #DIV/0! val­ue. This also means it will not be con­sid­ered by the =LOOKUP().
  • the item is not yet short­list­ed — val­ue of 1.

The expert move is putting 2 as the first argu­ment of the func­tion. Why not input 1 if no oth­er val­ues can appear? In the doc­u­men­ta­tion on office.com, the fol­low­ing note can be found:

Impor­tant: The val­ues in lookup_vector [that would be the 1/#DIV/0! stream] must be placed in ascend­ing order: …, ‑2, ‑1, 0, 1, 2, …, A‑Z, FALSE, TRUE; oth­er­wise, LOOKUP might not return the cor­rect val­ue. Upper­case and low­er­case text are equivalent. 

Look­ing-up for 1 has the poten­tial to mess-up the work­ings of the for­mu­la if the appro­pri­ate set of records is found in the lookup_vector (the entries in the From col­umn in the long list). Using 2 as an argu­ment is cir­cum­vent­ing the issue of the data not pro­duc­ing sort­ed stream of 1/#DIV/0! errors.

What is achieved is that on every oth­er row with the extract­ing for­mu­la the lat­est non-list­ed & non-blank val­ue from the long list is being populated.

If the =LOOKUP() func­tion can’t find any unique val­ue to list, it will resolve in #N/A error. That is why =IFERROR() enclos­es the =LOOKUP() and swaps any error with “-” (the val­ue with­in quotes can be removed or replaces with what­ev­er suits your purposes).

At this point extract­ing the unique jour­ney start­ing points is achieved. In the next post it will be demon­strate how the short list can be dynam­i­cal­ly con­vert­ed into a drop-down list from which the users can pick their choice (on the sheet Cal­cu­la­tor). The cell where this is done is a named range From_User_Choice.

When extract­ing the unique des­ti­na­tions, the select­ed ori­gin val­ue will be need­ed — despite the ref­er­ence being per­fect­ly good as a reg­u­lar one (e.g. =Calculator!$B$2), nam­ing it makes it much eas­i­er to under­stand what kind of infor­ma­tion is obtained. This helps not only when mak­ing screen­shots for an Excel blog but also when the file is opened months lat­er and one has no mem­o­ry of what was put in B2.


Obtain­ing the unique des­ti­na­tions is the next task.

This uti­lizes the same for­mu­la that has been used so far with just cou­ple of changes:

  • As expect­ed — the columns referred before are now shift­ed to the right
  • There is one extra log­i­cal check inside the body of the =LOOKUP() function.

The addi­tion of *(RoutesList[From]=From_User_Choice) is a call back to the TRUE/FALSE results and their mul­ti­pli­ca­tion. For every entry of the To col­umn in the long list, the cor­re­spond­ing val­ue in the From col­umn is test­ed against the val­ue cho­sen by the user as start­ing point (on the sheet Cal­cu­la­tor). This way anoth­er lay­er is intro­duced in the mul­ti­pli­ca­tion of the log­i­cal tests’ results.

If we assume that Meet­ing point 2 has been select­ed, the new (3rd) check will return FALSE for all rows not hav­ing it the From col­umn. Regard­less of how the oth­er two checks have turned out for the val­ues in the To column:

  • If the par­tic­u­lar val­ue is already in the short­list or not;
  • and if the val­ue is blank or not;

in case FALSE is returned in the third check, it will be treat­ed as 0 when mul­ti­ply­ing and in the work­ings of the for­mu­la, this means no new record for the short list.

Stack­ing more rules by mul­ti­ply­ing the log­i­cal tests’ out­comes grants the flex­i­bil­i­ty of being able to pre­cise­ly pin­point the items that need be short­list­ed. In the con­text of the exam­ple used, there might be more than one route between the same start­ing and end­ing points — mean­ing dif­fer­ent dis­tance trav­eled. Intro­duc­ing col­umn that car­ries such data can be eas­i­ly accom­mo­dat­ed into the fil­ter­ing rules.


Notes:

  • The solu­tion does not sort alpha­bet­i­cal­ly the results on any step. This makes it more appro­pri­ate to use when there will be rel­a­tive­ly small num­ber of unique values.
  • The above expec­ta­tion is relat­ed to the num­ber of rows in which the extract­ing for­mu­la will be present. In the­o­ry every row in the long list might have unique val­ue so the lack of dupli­cates will ren­der the unique list being the same length. How­ev­er, if such length is not uti­lized, it will be a drag on the spread­sheet per­for­mance. Arbi­trary deci­sion needs to be made so under­stand­ing the nature of the infor­ma­tion is essen­tial. Prop­er siz­ing helps efficiency.
Series Nav­i­ga­tion« (Overview) The gas mon­ey split­ting toolDynam­ic length named range for drop-down list »