What is the best way to lookup specific data in a set of lines or columns in VBA - Episode II

We all usually do for (and sometimes for each) loops, and by doing so we largely underutilize some very useful features of Excel, especially the lookup worksheet functions which are very good at fetching things.

This article is a follow-up to the previous similar article on the best way to find a value in a range or an array in Excel with VBA. We all usually do for (and sometimes for each) loops, and by doing so we largely underutilize some very useful features of Excel, especially the lookup worksheet functions which are very good at fetching things. The amazing comments and replies I got on Linkedin for this article pushed me to continue the previous article and try some of the solutions that were offered by my peer VBA developers. I honestly had no idea what the result would be, and I didn’t even know beforehand that some of these approaches were possible.

Let me re-describe the dataset and search scenario we have been looking at in the previous article. The A:B columns in Sheet2 link a capital city to its country. In Sheet3, column A contains a list of capital cities and we want to allocate a value (a country) to each of these capital cities. To make the timing differences easier to see, the search will be repeated 21 times.

Using a more precise timing function

We are getting serious here. We can’t use anymore the simple Time() VBA function that we previously used, as the precision is to the second. And trust me our journey to optimization today will bring us way below one second! What does a biologist do when he needs to looks at very small things? He takes a microscope! We need to get our own microscope.

Let me introduce for those who don’t know it yet the GetTickCount function from the Kernel32 library. Note that this only works well on 32-bit systems. But a quick search on Google should show you a way to make it work on 64-bit systems!

Just add this line at the very beginning of the module:

Private Declare Function GetTickCount Lib "kernel32" () As Long

GetTickCount() works similarly to the Time() function, except it counts roughly the number of milliseconds elapsed since your system has started (with a max of 49 days).

The common VBA search methods we previously tried

I re-timed the methods I used in the previous article, so that we can precisely compare with the new approaches I will be describing today. If you haven’t done it yet, I encourage you to go through this other article. Note that the run times are different (shorter) than in the previous article. This is linked to my current system. For some reason the for loops are much more competitive all of a sudden.

If working on the range object:

For loop: 10 seconds

For each loop: 5 seconds

Find method: 4 seconds

Vlookup function: 2 seconds

If working on a VBA array:

For loop: 4 seconds

For each loop: 5 seconds

Vlookup function: 7 seconds (so the vlookup function on a VBA array is not always faster than a for loop…)

Using the features of the collection VBA object to lookup values

First we need to build our collection with a string key:

Public DataCollection As Collection
Sub MAKE_COLLECTION()
Set DataCollection = New Collection
For Each c In Sheets("Sheet2").Range("A1:A500")
    If c.Value <> "" Then
        DataCollection.Add c.Offset(0, 1).Value, Key:=c.Value
    End If
Next
End Sub

 

Now let us use the collection approach which is another rather mainstream one.

Sub test_collection()
On Error Resume Next
t = GetTickCount()
For i = 0 To 20
    For Each c In Sheets("Sheet3").Range("A1:A500")
        c.Offset(0, 1).Value = DataCollection(c.Value)
    Next
Next
Debug.Print (GetTickCount() - t)
End Sub

Run time: 780 milliseconds

Wow, looks like we already crushed the competition! Hold-on to your seats because we are only becoming serious now.

Using the range array version of lookup functions

Now I will try the solution proposed by Peter B. using named formulas, named ranges, and the lookup function. This approach has some prerequisites here:

Sub test_Peter()
t = GetTickCount()
For i = 0 To 20
    [Etat] = [trouver.Etat]
    'trouver.Etat is a named formula
    '= IF(LEN(Capitale), LOOKUP(Capitale,data), "" )
    'Capitale is a named range with the search values
    'data is a named range of two columns Capital | Country 
    'Etat is named range where found values are placed
Next
Debug.Print (GetTickCount() - t)
End Sub

Peter solution’s run time: 202 milliseconds

Ok, wow that was kick-ass fast!

Let us break down this solution in a more common expression. Evaluate simply computes the named formula. Using the brackets is similar to using the range method. For example here: [Etat] = Range("Etat"). The LOOKUP function is similar to vLookup and hLookup, do have a look at the specifications of the Lookup function.

So rewriting this solution in a way that is comparable to other solutions, we have:

Sub test_Peter2()
t = GetTickCount()
Set SearchList = Sheets("Sheet3").Range("A1:A500")
Set DataSet = Sheets("Sheet2").Range("A1:B500"
For i = 0 To 20
    Sheets("Sheet3").Range("B1:B500") = WorksheetFunction.VLookup(SearchList, DataSet, 2, 1)
Next
Debug.Print (GetTickCount() - t)
End Sub

Run time: 260 milliseconds

Note that Peter’s solution is still slightly faster. I suppose that using named ranges instead of actual range coordinates is a little faster to execute.

There are two very interesting take-away here:

  • Using array lookup functions is very efficient on large datasets,
  • Using “binary” lookup instead of “linear” lookup is more efficient on large datasets.

I expect that most people are puzzled, just like I was when I did some search for this article. Let us have a look at what these sentences actually mean.

The array version of the lookup function in Excel VBA

Few people know it (and I included the pre-article version of me here), but the lookup, vlookup and hlookup can also be used on arrays instead of single values. Don’t misunderstand I am not talking about VBA arrays here!

In Excel you can use vlookup in two ways:

  • Single cell function: VLOOKUP(OneCell, SearchRange, ColumnNumber, 0)
  • Matrix function: {VLOOKUP(SeveralCells, SearchRange, ColumnNumber, 0)}

The binary lookup algorithm

The Binary search was recommended as a comment to my previous article by Michael B., and I have to admit I didn't know what it really was before.

The binary lookup algorithm is a fancy name for what an Excel user would call a non-exact match with the lookup functions. Remember that last parameter in VLOOKUP that you nearly always set to FALSE (or 0)? The 0 stands for the exact match algorithm while the 1 stands for, well, non-exact match.

  • Exact Match (Linear search): VLOOKUP(OneCell, SearchRange, ColumnNumber, 0)
  • Non-Exact Match (binary search): VLOOKUP(OneCell, SearchRange, ColumnNumber, 1)

What is the point of a function that doesn’t match the value you are looking for you may wonder? It actually matches with the value that is closest to the value you are looking for.

  • The main drawback of this non-exact lookup is that the searched dataset (specifically the column containing the searched values) has to be sorted in ascending order.
  • The main advantage is that the non-exact match is much faster to run on very large datasets.

How can it be faster? Instead of looking at all the values linearly, the algorithm will look at the value in the middle of the dataset and compare it to the value you are looking for. For example if middle value > searched value, then it means that the value we are looking for is in the first half of the dataset, and the other half can be discarded, and so on until the “middle value” is equal to the searched value or the dataset can’t be split further. Now you see why the dataset has to be sorted!

Using the binary lookup worksheet functions in VBA

Now we will dig further into this Binary search analysis. Because the search times are really short and the resolution (even that of the millisecond timer) isn’t that good, we need to increase the number of search and reduce the other elements impacting on the search time like worksheet interactions.

Now let us think for a second about what the Binary search algorithm we just described really means. Obviously on short search datasets, the performance difference between the Linear and Binary search is tight. But when the size of the search dataset increases, the binary search algorithm clearly dominates the Linear search algorithm. After only 5 iterations, the Binary search only has to search the remaining $0.5^5 = 3.1%$ of the whole dataset, while the linear search is sadly only 5 rows down the full list!

We can even try and generalize this a little: we want to find the number I of iterations required to reduce the number of items left to search to 1 (finding the result), let there be X records to search through.

$ X . 0.5^I = 1 $ => $ 0.5^I = 1/X $ => $ e^{I . ln(0.5)} = 1/X $ => $ I . ln(0.5) = ln(1/X) $=> $ I = {ln(1/X)}/ {\ln(0.5)} $

Running this calculus for 1,000,000 rows in the dataset, it should only take no more than 20 iterations to find the result, so up to 50,000x faster than the linear search! How insane is that?

To convince us, let us make a little experiment. I have created a silly dataset with 150k rows with integers in the first column sorted in ascending order. A list of 10k items (that exist in the dataset) will be searched in the dataset. The dataset will be progressively enlarged from 10k to 150k lines.

Search dataset

Search list

The search dataset will be gradually enlarged from 10k rows to 150k. The dataset look like this:

Key

Value

1

xxxxx1

2

xxxxx2

3

xxxxx3

4

xxxxx4

5

xxxxx5

6

xxxxx6

 

 

Find the values associated to a list of 10k integers ranging from 1 to the maximum key number in the currently searched dataset.

Search these keys:

146945

126305

39168

63374

55492

25553

The search will be performed with both the linear and binary algorithms.

At the end we compare both search results to be absolutely certain the binary lookup algorithm didn’t “cheat”.

Here comes the code:

Sub gradualIncreaseBinSearch()
Dim seed As Integer
Dim SearchArea As Range, LookedUpValues As Range
seed = 1
Dim ArrayOfLookedUpValues(1 To 10000, 1 To 1) As Long
Dim ArrayOfFoundValuesLin(1 To 10000, 1 To 1) As String
Dim ArrayOfFoundValuesBin(1 To 10000, 1 To 1) As String
For i = 1 To 15
    Set SearchArea = Sheets("Sheet5").Range("A2:B" & i & "0000")
    ' Set the list of values to search
    For j = 1 To UBound(ArrayOfLookedUpValues, 1)
        ArrayOfLookedUpValues(j, 1) = Rnd(seed) * (WorksheetFunction.Max(Application.WorksheetFunction.Index(SearchArea, 0, 1)))
    Next
    Debug.Print SearchArea.Address
   
    Sheets("Sheet5").Range("i2:i2001") = ArrayOfLookedUpValues
    Set LookedUpValues = Sheets("Sheet5").Range("i2:i2001")
    On Error Resume Next
    'Binary search
    t = GetTickCount()
    For k = 1 To UBound(ArrayOfLookedUpValues, 1)
        ArrayOfFoundValuesBin(k, 1) = WorksheetFunction.VLookup(ArrayOfLookedUpValues(k, 1), SearchArea, 2, 1)
    'LookedUpValues.Offset(0, 1) = WorksheetFunction.VLookup(ArrayOfLookedUpValues, SearchArea, 2, 0)
    Next
    binTime = (GetTickCount() - t)
    t = GetTickCount()
    For k = 1 To UBound(ArrayOfLookedUpValues, 1)
        ArrayOfFoundValuesLin(k, 1) = WorksheetFunction.VLookup(ArrayOfLookedUpValues(k, 1), SearchArea, 2, 0)
    'LookedUpValues.Offset(0, 1) = WorksheetFunction.VLookup(ArrayOfLookedUpValues, SearchArea, 2, 1)
    Next
    LinTime = (GetTickCount() - t)
    Sheets("Sheet6").Cells(i + 1, 1) = i
    Sheets("Sheet6").Cells(i + 1, 2) = binTime
    Sheets("Sheet6").Cells(i + 1, 3) = LinTime
   
    'check identical found values
    Sheets("Sheet6").Cells(i + 1, 4) = "Identical values"
    For j = 1 To UBound(ArrayOfLookedUpValues, 1)
        If ArrayOfFoundValuesLin(j, 1) <> ArrayOfFoundValuesBin(j, 1) Then
            Sheets("Sheet6").Cells(i + 1, 4) = "NOT Identical values"
            Exit For
        End If
    Next
    DoEvents
Next
End Sub

Now behold the resulting graph:

What do we see? The Binary search is nearly not impacted by the length of the dataset, while the Linear search is (linearly) impacted by the length of the dataset.

 

References

  • On get tick count : https://msdn.microsoft.com/en-us/library/windows/desktop/ms724408%28v=vs.85%29.aspx

The Linkedin groups:

  • Excel Developers: https://www.linkedin.com/groups/58704
  • Excel and VBA users: https://www.linkedin.com/groups/82527
  • Special thanks to Michael B., Peter B., Richard D., Hans H. and the many others who contributed. I haven’t been able to test all the solutions (dictionary, autofilter…) but I will try to remedy this as soon as I have some time!
Matthieu Liatard
Remember this!
Se souvenir de cela
Play!
Tester mes connaissances
Close

Bonjour :)
Avec Question-Player, apprenez de nouvelles choses et n'oubliez plus rien. Jamais. Et en plus c'est gratuit.

Pour nous aider, vous pouvez créer un compte:

Login / Créer un compte
Notez cet article :
1
2
3
4
5

Up next

Articles similaires