Skip to article frontmatterSkip to article content

Search and Sort

After reading this chapter, the reader should know the basics of search and sort algorithms, how to reason about search and algorithms, and general guidelines for choosing the right search and/or sort algorithm to use to solve the problems that might arise in a coding interview. This chapter will build on ideas presented in earlier chapters.

While there will be some new concepts introduced in this chapter, my goal here is to help you build intuition about search and sort as concepts. There are really a lot of problems that can be solved with search and sort and understand both as a concept will make you a more skilled problem solver in general.

I was a programmer for a long time before I came across Jon Bentley’s excellent 1986 book Programming Pearls in a thread on Reddit. It was the first book that got me to see computer science as way of thinking, rather than just a set of tools. This was a real eye opener for me, and put everything I’d learned about programming into a very new context — one that lead straight to the writing of this very book.

This chapter is heavily influenced by Bentley’s book, and I hope I can inspire someone even just a fraction of the way in which Bentley has inspired me. If inspiration isn’t your thing, and you’re more about just the facts, man, then this section will still be useful to you, as it serves as a kind of “capstone” to all of the previous chapters. It will attempt to tie them together in a way that hopefully makes them less abstract and more useful, and gives you a framework in which to consider your problem solving skill both in interviews and in your work as a programmer.

Search and sort are to very elemental concepts in computer science, and they are used in a wide variety of problems. This chapter will walk through two examples of where they might be found in the wild, and show the advantages and disadvantages of the various search and sort algorithms you might have at your disposal. Hopefully this chapter also serves as a decent foundation for reasoning through the problems you’re likely to encounter in interviews.

You have have heard the (mostly true) story that modern computer science was invented by Alan Turning while he was trying to break the German Enigma code during World War II. While Turing wasn’t using “computers” in the way we think of them today, but he was engaged in a methodical search for a key to break the German code. He saw that the search could be greatly simplified by automating the process of iteratively trying different keys until the right one was found, and so he build a machine that did exactly that. Check out “The Imitation Game” for a dramatized version of this story, one of the few movies about computer science that doesn’t fictionalize computers to the point of absurdity. (“Sneakers” is another good one, email me for the rest of the list.)

Since then, search has remained a fundamental concept in computer science, which should be no surprise to any regular computer user.

What does “search” mean?

Just to be crystal clear about what is meant by “search,” search specifically means finding a specific item in a collection of items. The item could be a number or a string or some other data type, but we are searching for one definite thing in a collection of those things.

How is search used?

For a long time most people’s only experience with search was through word processing software, like Microsoft Word, where you could search for all instances of a word or phrase in a document. More advanced uses could like for files in the file system, or maybe for something like specific emails in an email inbox.

Once the world wide web came along search because a very important concept, as there was not literally a world of information to search through. I remember reading an early description of the web that described as “a vast library with no librarian, and all the books are on the floor.” There were some interesting attempts at early search engines like Lycos and AltaVista. Not long afterwards, however, this little upstart company named “Google” came along and added the concept of “PageRank” to search, which was a way of ranking search results by their relevance to the query.

Why is search important?

While we tend to think of search as what happens when you hit “Ctrl-F” or type a search term into your browser bar, search is used to solve a wide variety of problems in computer science. While that might seem obvious, understanding search is not just about knowing how to find something but about knowing the best way to find something. You don’t want to use a hammer when you need a screwdriver (and vice versa), so knowing how to think about search in an effective manner is important. You might not directly be asked a search question in an interview, but you will likely be asked a question that will require you to reason about search in some way.

The rest of this section will present a “case study” of search, based on the complicated idea of a simple search. It will blend in many ideas from earlier chapters with the intent of building a solid framework for thinking about search in general.

Let’s start with the basic idea: finding a word in a document. You are applying for a job at a data science company, and one of their products is based around the idea of fast document search. The good news is, there’s only one document! And it only has one sentence in it! How can you search this document? Well just sit down and read the sentence, right?

I’m not trying to be glib. Unless it’s a very long sentence, that’s going to be optimal. That’s starting with the simplest and most effective solution. But then, of course, the document will grow and so yes, you will need to find a more effective method for searching it.

Just to set one more expectation before we get into the details: we’re going to be agnostic about where the data is stored. Maybe it’s in a database, or a text file, or on a web page, or in an email, or on a spreadsheet…etc. For the purposes of this example, we’re not going to worry about that.

We’re also going to be agnostic about the environment in which the search is being performed. Maybe it’s a document in a GUI, maybe it’s a web page, maybe it’s a command line interface. We’re not going to worry about that either. We’re just going to focus on the baseline algorithmic “mechanics” of search itself.

Now the client has expanded the document to include multiple sentences, and you still need to find that specific word. Given the scale of text to be searched has grown very little, you can probably at this point just work with a simple linear search. All this means is going through the document word by word and checking each word to see if it’s a match.

def linear_search(document, word):
    for w in document.split():
        if w == word:
            return True
    return False

This will work just fine for a small document, and by “small” it could be a document of several hundred words or even a few thousand words. Notice also that you’re just looking for a match. You’re not looking for all of the instances of the word, just whether or not it’s in the document.

What is the time complexity of this search? Take a moment to think about it before reading on…. The time complexity of this search is O(n), where n is the number of words in the document. Of course, this is only O(n) if the word is at the very end of the document. If the word is at the beginning of the document then the time complexity is O(1) because there’s only one word to check. But O(n) is the worst case scenario, and so we use that as the time complexity for this search. As has been mentioned repeatedly in this book, that means that while this might be great for a small document, it will not scale well to larger documents.

What is the space complexity of this search? There hasn’t been much discussion of space complexity in this book, but it absolutely is a useful concept to understand. Space complexity is the amount of memory that needs to be allocated in order to perform the search. Space is a lot less important than it used to be because additional computer memory has come way, way down in cost from the earliest days of computing, but that doesn’t mean it doesn’t matter.

On a desktop or cloud server you’re probably going to have all of the the memory you could ever need. On mobile or embedded devices, however, memory is still at a premium, and likely will be for quite some time. What you don’t want to do is all of a sudden run out of memory because you haven’t accounted for the space complexity of your search algorithm.

Space complexity notation is similar to time complexity notation, and is also expressed in terms of Big O notation. For linear search no additional memory is needed, so the space complexity is O(1). As we progress through the examples in this chapter we absolutely will see cases where the space complexity needs to be taken into account. In a programming interview if you are asked about the space complexity of your algorithm you’re likely working on specialized hardware (like the chipset in a mobile phone or toaster) and you should be prepared to discuss it.

So now, of course, the client has come back to you and with the document expanded to include several million words, and you need to find a better way to search it.

We discussed binary search in chapter 4 in the context of searching through a sorted array. Binary search is a very efficient way to search through a sorted array, and it has a time complexity of O(log n). We’re making a bit of a jump in data structures here, since a document is not an array, nor will the words in a document be sorted, nor are you likely to gain much of an advantage by sorting them first. But binary search is the next most complicated form of search after linear search, so let’s imagine that the client has moved from documents to lists of words, for whatever reason.

As mentioned in Chapter 4, binary search works by dividing the search space in half on each pass through the data. The split in then reasoned about in terms of the element where the data is split. If you’re looking for a word that begins with the letter “c,” and you split the list of words in half at the letter “m,” then you know the word you’re looking for is in the a-m half of the list. You might then split in half at letter “g,” and then again at letter “d,” finally arriving at the letter “c” and finding the word you’re looking for.

def binary_search(words, word):
    low = 0
    high = len(words) - 1

    while low <= high:
        mid = (low + high) // 2
        if words[mid] == word:
            return True
        elif words[mid] < word:
            low = mid + 1
        else:
            high = mid - 1

    return False

The tradeoff between linear and binary search is hopefully clear: binary seach is faster, but requires that the data be sorted first. Does it make sense to sort the data in a document before searching it? Probably not. Breaking the data down into an alphabetical list of words might give a small advantage, but trying to keep the sorted list related to the original document — especially one that is constantly changing — would kill any advantage you might initially gain from sorting it in the first place. But for data that is already sorted, or can easily be sorted, binary search is a great option.

The time complexity of binary search is O(log n), which should not be a surprise because, as mentioned in Chapter 9, whenever you’re constantly splitting the search space in half repeatedly you’re likely looking at a logarithmic time complexity. This is a huge improvement over linear search in and of itself, but again, the form of the data matters.

The space complexity of binary search is the same as linear search, O(1), again because no additional memory is needed to perform the search.

Jump search is a search algorithm that also works on sorted arrays. Jump search attempts to split the difference between linear and binary search by jumping ahead to a specific point in an array and then from there performing a linear search. The blocks are usually based around some kind of fixed size, and can be based either on the size of the array or by the partitioning of the data. It’s slower than binary search but faster than linear, and is useful when the data is not sorted but can be delineated into blocks. It might be useful, for instance, in searching through data that is broken down into “pages” of “chapters.” Linked lists is another example of data in blocks. It’s not a commonly search algorithm, but it’s worth thinking about in terms of building intuition about search algorithms in general.

If the data is not already logically divided into blocks, one easy way to create block sizes is by using the square root of the size of the entire array. In the following example we’ll use the square root of the size of the array to determine the block size, and then perform a linear search within that block. This makes the time complexity of jump search O(√n), which, again, is better than linear search but not as good as binary search. It really comes down to whether or not sorting the data further is worth the time you’ll save.

def jump_search(arr, x):
    n = len(arr)
    step = int(math.sqrt(n))  # Block size
    prev = 0

    while arr[min(step, n)-1] < x:
        prev = step
        step += int(n**0.5)
        if prev >= n:
            return False

    while arr[prev] < x:
        prev += 1
        if prev == min(step, n):
            return False

    return arr[prev] == x
```javascript

### Tokenization / Normalization / Filtering

Tokenization, normalization, and filtering are all techniques that can be used to improve search performance, especially when dealing with text data. All three techniques are about converting existing data into a more manageable form so that it can be searched more efficiently.

#### Tokenization

Tokenization is a process of breaking up data into its most elemental pieces, or "tokens." In the case of text data that usually means individual words. You’re already familiar with the concept of tokenization from the basic math problems you did in elementary school.

23 + 41 = 64

One way to look at this problem is by saying it is made up of three tokens and two operators. The tokens are "23", "41", and "64", and the operators are "+" and "=". You understand that the operators are used to manipulate the tokens, and that the tokens are the data that is being manipulated. It’s the same case with tokenizing digital data. You are breaking it down into it’s most elemental chunks. In looking at the above example, even though the number 23 is comprised of two digits, one representing "twenty" and the other representing "three," there is no confusion about the number 23 as a token. You will not confuse it for 2+3, or 32, or 2.3, or 2^3 or any other such nonsense. It’s the number 23 — a single token that represents a specific value that is ready to be manipulated by the operators.

#### Normalization

Normalization is a process of standardizing data to remove ambiguities and inconsistencies. One common example that you still see programmers struggle with is the difference between the order of numbers and the order of letters in a string. I worked on a recent project where the files were ordered by the names of the numbers they represented, "one", "two", "three", etc. The problem there is that the order of the letters in the string is not the same as the order of the numbers they represent. So "one" comes before "two" in this order, but "three" also comes before "two." Alphabetically the number "20" comes after "2" but before "30." What about the name "Cooper" and the word "cooper"? Which comes first?

English grammar has its own set of rules, mathematics another, and computer science yet a third, but at the end of the day the answer is, of course, that "it depends." It depends on the context in which the data is being used.

What normalization does is standardize data to remove any such questions. Commonly in terms of documents this will mean converting all text to lowercase and removing punctuation. This might also involved combining words that stem from the same root, such as "run", "running", and "ran" into a single token, "run". With this process programmers can ensure that data is in a consistent format, which makes it easier to search and compare.

#### Filtering

Filtering is a process of removing unnecessary or irrelevant data from a dataset. Usually this at least means removing "noise" from the data. For instance, I went to middle school back in the days where they taught typing classes on actual typewriters, and we were taught after ending a sentence to hit the space bar twice before starting the next. The only reason this was done was to make the text more readable. In modern typography, however, this is simply "noise." Web browsers and even some text editors will simply collapse multiple spaces into a single space. (Try typesetting an HTML document using the space bar and you’ll quickly see exactly what I mean.)

When it comes to data, filtering is about removing any data that is not likely to be useful in terms of a search. This might include the removal of words that are too common to be useful, such as "the", "and", "is." (Apologies to the rock band The The.)

There might be more specific filtering that is done based on context, like removing common nouns or verbs that are not relevant to the search. In addition to spaces there might be other "invisible" characters such as line breaks or tabs that might be in the document but should not be considered in a search. Filtering removes all of these which, again, makes the data easier to search.

### Hash-based search

The earlier search algorithms are great if the data benefits from first being sorted. But what if you have a large amount of unsorted data to search through? Sorting is not the only way to improve search performance. Sometimes all it takes is organizing things a little bit differently. This is where hash-based search comes in.

Hash-based search is a way of organizing data so that it can be searched very quickly, without the need for sorting it first. Hashing, as was covered in Chapter 8, is analogous to the "Jump/Block Search" covered in the previous section. The difference is that while with Jump/Block Search the data is already mapped in some logical fashion, hash-based search creates a mapping between the data and a hash table.

This is done by taking the data and applying a hash function to it, which produces a unique hash value for each piece of data. The hash value can then be used to quickly look up the data in a hash table. This process is similar to an index in a book, where the index — usually found at the back in alphabetical order — allows readers to quickly find the page number for a specific topic. Hashing is usually done to a document after it has been tokenized, normalized, and filtered, as mentioned in the previous section.

Hashing is a very efficient way to search through large datasets, given the expense of hashing them first can be afforded. The time complexity of hash-based search is O(1) on average, which is awfully hard to beat, but of course there’s a tradeoff. You have to first be able to index the available data.

Does this sound absolutely crazy? Fantastical? Unpossible? Well it’s pretty much how Google search works. Google sends out "bots" to crawl the web, and the bots index the data they find. When you go to search for a web page on Google it looks up documents that match your search terms in its index, and then quickly returns the results to you. Some of this becomes complicated when you add in things like sponsored search and AI results, but that’s the basic idea.

The rest of this section will focus on some of the more advanced and nuanced ideas behind search, and how they can be applied to the kinds of problems computer programmers regularly face. Hopefully this will help you develop your own ideas about how to apply some of the more abstract knowledge around search to solving real-world problems. If you are asked in an interview a question along the lines of "how would you solve this problem?" this chapter should provide a strong basis for how you answer that question.

But first, we need data! I’ve created this Python Fruits-of-the-World dataset generator that you can use to follow along with the examples. The fruit names are real but the rest of the data is generated at random, so don’t be surprised if you find black strawberries or yellow grapes. Keep in mind that any data will do. If fruit doesn’t feel right, then please feel free to substitute with data of your own.

``` python
import random

def generate_fruit_data(num_fruits=100):
    # Generate a diverse database of fruits from around the world.

    fruit_names = [
        # Common fruits
        "Apple", "Banana", "Orange", "Grape", "Strawberry", "Blueberry", "Raspberry",
        "Blackberry", "Cherry", "Peach", "Pear", "Plum", "Apricot", "Watermelon",
        "Cantaloupe", "Honeydew", "Pineapple", "Mango", "Papaya", "Kiwi",

        # Citrus varieties
        "Lemon", "Lime", "Grapefruit", "Tangerine", "Clementine", "Blood Orange",
        "Yuzu", "Bergamot", "Pomelo", "Key Lime",

        # Tropical fruits
        "Coconut", "Passion Fruit", "Dragon Fruit", "Star Fruit", "Guava",
        "Lychee", "Rambutan", "Durian", "Jackfruit", "Breadfruit", "Soursop",
        "Cherimoya", "Ackee", "Tamarind", "Plantain",

        # Berries and small fruits
        "Elderberry", "Gooseberry", "Currant", "Cranberry", "Lingonberry",
        "Cloudberry", "Mulberry", "Boysenberry", "Loganberry", "Huckleberry",

        # Stone fruits
        "Nectarine", "Damson", "Greengage", "Mirabelle", "Pluot", "Aprium",

        # Exotic and regional fruits
        "Persimmon", "Pomegranate", "Fig", "Date", "Olive", "Avocado",
        "Tomato", "Cucumber", "Eggplant", "Pepper", "Squash", "Pumpkin",
        "Acai", "Goji Berry", "Physalis", "Tomatillo", "Cactus Pear",
        "Horned Melon", "African Cucumber", "Buddha's Hand", "Finger Lime",

        # More exotic varieties
        "Salak", "Longan", "Langsat", "Mangosteen", "Rambai", "Pulasan",
        "Santol", "Mamey Sapote", "Sapodilla", "Sugar Apple", "Custard Apple",
        "Sweetsop", "Miracle Fruit", "Jabuticaba", "Pitanga", "Camu Camu",
        "Acerola", "Cashew Apple", "Marula", "Baobab Fruit", "Kei Apple",

        # Additional varieties to reach 100
        "Quince", "Medlar", "Serviceberry", "Pawpaw", "Maypop", "Muscadine",
        "Chokeberry", "Beautyberry", "Sumac Berry", "Rose Hip", "Hawthorn",
        "Sea Buckthorn", "Cornelian Cherry", "Jujube", "Loquat", "Feijoa"
    ]

    colors = ["Red", "Orange", "Yellow", "Green", "Blue", "Purple", "Pink", "Brown", "White", "Black"]

    regions = [
        "North America", "South America", "Europe", "Asia", "Africa",
        "Australia", "Mediterranean", "Caribbean", "Middle East", "Pacific Islands"
    ]

    textures = ["Smooth", "Fuzzy", "Bumpy", "Spiky", "Rough", "Waxy", "Soft", "Hard"]

    seasons = ["Spring", "Summer", "Fall", "Winter", "Year-round"]

    fruits = []
    for i in range(min(num_fruits, len(fruit_names))):
        fruit = {
            "id": i + 1,
            "name": fruit_names[i],
            "color": random.choice(colors),
            "region": random.choice(regions),
            "price_per_lb": round(random.uniform(0.99, 12.99), 2),  # $0.99 to $12.99
            "sweetness": random.randint(1, 10),  # 1-10 scale
            "texture": random.choice(textures),
            "season": random.choice(seasons),
            "vitamin_c": random.randint(5, 200),  # mg per 100g
            "size": random.choice(["Small", "Medium", "Large", "Extra Large"])
        }
        fruits.append(fruit)

    return fruits

# Usage
fruits = generate_fruit_data(100)
print(f"Generated {len(fruits)} fruits from around the world!")
print(fruits[:5])  # Print first 5 fruits for brevity

Data management is a subject unto itself, and we’re not going to get deep into it here. This is a flat data table (not a database, data shard, data anything but flat data) containing 100 fruits, all of which contain ten data attributes: id, name, color, region, price_per_lb, sweetness, texture, season, vitamin_c, and size. The id is unique for each entry, allowing each piece of data to have its own specific identifier. Again, other than the names, all of the data is generated at random, so please do not calibrate your Vitamin C intake based on what this dataset tells you, or get on a plane expecting to find some exotic spiky pink pear variant during winter in Cameroon.

Now that the data has been generated we have to index it to make it searchable. Here’s how a hash table can be used to accomplish this next step:

def create_searchable_fruit_index(fruits):

    name_index = {}
    color_index = {}
    region_index = {}

    for fruit in fruits:
        # Index by name
        name_index[fruit['name'].lower()] = fruit

        # Index by color
        color = fruit['color'].lower()
        if color not in color_index:
            color_index[color] = []
        color_index[color].append(fruit)

        # Index by region
        region = fruit['region'].lower()
        if region not in region_index:
            region_index[region] = []
        region_index[region].append(fruit)

    return name_index, color_index, region_index
```python

All this code is doing is creating three separate indices, one that allows for search by name, one that allows to search by color, and one that allows to search by index. Each of these indicies (indexes? indicum?) is a Python dictionary that maps the search term to the appropriate fruit data.

You may have noticed the copious use of the `lower()` function, which in Python converts a word to lower case. This falls right in line with the idea of "normalization" discussed earlier. It’s a good way to make sure that all instances of data conform, as closely as possible, to the same format.

    name_index = {
        'apple': {'id': 1, 'name': 'Apple', 'color': 'Red', 'region': 'North America', ...},
        'banana': {'id': 2, 'name': 'Banana', 'color': 'Yellow', 'region': 'South America', ...},
        'orange': {'id': 3, 'name': 'Orange', 'color': 'Orange', 'region': 'Mediterranean', ...},
        'grape': {'id': 4, 'name': 'Grape', 'color': 'Purple', 'region': 'Europe', ...},
        # ... etc
    }

    color_index = {
        'red': [{'id': 1, 'name': 'Apple', 'color': 'Red', 'region': 'North America', ...}],
        'yellow': [{'id': 2, 'name': 'Banana', 'color': 'Yellow', 'region': 'South America', ...}],
        'orange': [{'id': 3, 'name': 'Orange', 'color': 'Orange', 'region': 'Mediterranean', ...}],
        # ... etc
    }

    region_index = {
        'north america': [{'id': 1, 'name': 'Apple', 'color': 'Red', 'region': 'North America', ...}],
        'south america': [{'id': 2, 'name': 'Banana', 'color': 'Yellow', 'region': 'South America', ...}],
        'mediterranean': [{'id': 3, 'name': 'Orange', 'color': 'Orange', 'region': 'Mediterranean', ...}],
        # ... etc
    }

Notice the data relationships are a little different between each of the groups although, again, we’re not going to get into data theory. Just try to see the relationships in exactly the straightforward manner in which they’re intended.

What’s "hash" about this is that all of the data is stored in Python dictionaries, which automatically covert string values into hash values. This allows for much faster lookups, as will become more evident as we proceed through the examples. This is particularly a Python feature, and I mentioned in Chapter 8 that a Python dictionary is a kind of hash table. If you’re using other programming languages the same basic concept applies but the implementation will be different. In Javascript, for instance, objects can only use strings as keys and so no hashing occurs. In order to gain the benefits of hashing in Javascript, one solution is to use a Javascript `Map()`. Make sure you are aware of the implementation differences before heading in to an interview!

> **So How Fast Is It?**
>
> This book has focused solely on Big-O notation for speed comparison, because Big-O is an objective measure of speed that discounts differences between things like system architecture, operating systems, chip sets, internet speed, etc., etc. Even if you have a great imagination unmatched only by your sense of relative proportions, the difference between O(n) and O(1) might not seem particularly earth-shattering without some real numbers behind it.
>
> So here is code that will return the difference, in the universally understood concept of "Earth seconds," between searching un-indexed and indexed code. After you’ve got your fruit generator set-up and running, go ahead and try it!
>
>     import time
>
>     def search_without_index(fruits, search_term, search_type="name"):
>         """Linear search through all fruits - O(n) time complexity"""
>         results = []
>         for fruit in fruits:
>             if search_type == "name" and search_term.lower() in fruit['name'].lower():
>                 results.append(fruit)
>             elif search_type == "color" and search_term.lower() == fruit['color'].lower():
>                 results.append(fruit)
>             elif search_type == "region" and search_term.lower() == fruit['region'].lower():
>                 results.append(fruit)
>         return results
>
>     def search_with_index(search_term, search_type, name_index, color_index, region_index):
>         """Hash-based search using indexes - O(1) time complexity"""
>         if search_type == "name":
>             result = name_index.get(search_term.lower())
>             return [result] if result else []
>         elif search_type == "color":
>             return color_index.get(search_term.lower(), [])
>         elif search_type == "region":
>             return region_index.get(search_term.lower(), [])
>         return []
>
>
>     def create_searchable_fruit_index(fruits):
>         """Create multiple indexes for different search types"""
>         name_index = {}
>         color_index = {}
>         region_index = {}
>
>         for fruit in fruits:
>             # Index by name
>             name_index[fruit['name'].lower()] = fruit
>
>             # Index by color
>             color = fruit['color'].lower()
>             if color not in color_index:
>                 color_index[color] = []
>             color_index[color].append(fruit)
>
>             # Index by region
>             region = fruit['region'].lower()
>             if region not in region_index:
>                 region_index[region] = []
>             region_index[region].append(fruit)
>
>         return name_index, color_index, region_index
>
>     def demonstrate_search_performance():
>         """Show the dramatic difference between indexed and non-indexed search"""
>
>         # Generate test data
>         fruits = generate_fruit_data(100)
>         name_index, color_index, region_index = create_searchable_fruit_index(fruits)
>
>         # Test searches
>         test_searches = [
>             ("Apple", "name"),
>             ("Red", "color"),
>             ("Asia", "region")
>         ]
>
>         print("Search Performance Comparison")
>         print("=" * 50)
>
>         for search_term, search_type in test_searches:
>             print(f"\nSearching for {search_type}: '{search_term}'")
>
>             # Linear search timing
>             start_time = time.time()
>             linear_results = search_without_index(fruits, search_term, search_type)
>             linear_time = time.time() - start_time
>
>             # Indexed search timing
>             start_time = time.time()
>             indexed_results = search_with_index(search_term, search_type, name_index, color_index, region_index)
>             indexed_time = time.time() - start_time
>
>             # Results
>             print(f"  Linear search: {len(linear_results)} results in {linear_time:.6f} seconds")
>             print(f"  Indexed search: {len(indexed_results)} results in {indexed_time:.6f} seconds")
>
>             if indexed_time > 0:
>                 speedup = linear_time / indexed_time
>                 print(f"  Speed improvement: {speedup:.1f}x faster")
>             else:
>                 print(f"  Speed improvement: Too fast to measure!")
>
>     demonstrate_search_performance()

### Inverted Index Search

Now that data can be converted into hash tables, we’re ready to explore the more advanced concept of an inverted index. An inverted index is an index that’s inverted (obvs) in the sense that the data is matched to search terms, instead of search terms to data. In other words, it’s an index that maps each item to the terms that it contains, rather than the other way around. If we’re search for all fruits that are red, we can do that quickly because the inverted index contains an entry that contains all fruits that are red. It’s not necessary to go through every single piece of data one-by-one at O(n) speed to pull out all of the red ones. That work has already been done when the index was created.

Again, this is exactly how Google search works — or how it works in the over-simplified world of book examples. Google special insight was not only to index pages in order to search them, but to equally add "relevance" rankings. If you do a Google search for a word, Google has not only indexed all of the pages its bot crawlers have found that contain that word, but it has also ranked them in terms of relevance. "Relevance" is comprised of a number of factors, including how many times the word appears on a page, its importance to the surrounding text, how many other pages with similar terms link to it, how often people looking for the term click on that page, and so on. That might seem like an obvious thing to do now, but at the time it was a brilliant insight that revolutionized the way in which the web was searched.

### Trie-based Search

Ok, the basic search is working, it’s fast, the fruits of the world are available to all at a moment’s notice. Let’s look deeper into the world of search to see what other problems it can solve. Chapter 9 briefly touched upon the idea of tries, a tree-based data structure that could break strings down into smaller chunks. What this allows for in the search world is auto-complete features.

Each node in a trie represents a character in a string, and a path from the root to a leaf node represents a complete string. This allows for "prefix search," where all strings that start with a given prefix can quickly be found.

<div class="formalpara">

<div class="title">

Apache Solr

    The ideas in this chapter are mostly meant to provide a context against which to solidify readers' knowledge of search and sort, but there are actually jobs that involve expertise in searching and sorting.

    One widely used open source search platform is Apache Solr, which is built on top of the Apache Lucene search library.
    Solr is a powerful tool for enterprise-level search applications, and it comes with a wide range of very useful features.
    I once worked for a client that had put all of their marketing materials into a "data lake," and Solr search coupled with GraphQL allowed us to grab marketing materials in real time based on a given set of search terms.
    We bolted on the user interface using React and created a very powerful tool for creating custom web experiences.

    If search is really, truly touching you in the deepest parts of your soul, look up "Java Solr Developer" jobs and see if you can find a position that will allow you to work with it.

For instance, if you’re a fruit poet and you’re looking for all fruits that begin with the better "b," you can use a trie-based search to quickly find all fruits that start with "b" without having to search through the entire dataset. The basic methods for creating tries from data was covered in Chapter 9, so go back and have a look there if you need a refresher and to see an explanation of the `TrieNode` class that is used to build the Trie.

``` python
class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False

class FruitTrie:
    def __init__(self):
        self.root = TrieNode()
        self.fruit_suggestions = {}  # Maps complete words to fruit objects

    def insert_fruit(self, fruit):
        """Insert a fruit into the trie for autocomplete"""
        name = fruit['name'].lower()
        self.fruit_suggestions[name] = fruit

        # Insert into trie
        current = self.root
        for char in name:
            if char not in current.children:
                current.children[char] = TrieNode()
            current = current.children[char]
        current.is_end_of_word = True

    def get_autocomplete_suggestions(self, prefix):
        """Get fruit suggestions for autocomplete"""
        prefix_lower = prefix.lower()

        # Navigate to prefix node
        current = self.root
        for char in prefix_lower:
            if char not in current.children:
                return []
            current = current.children[char]

        # Collect all complete words from this prefix
        suggestions = []
        self._collect_suggestions(current, prefix_lower, suggestions)

        # Return corresponding fruit objects
        return [self.fruit_suggestions[word] for word in suggestions if word in self.fruit_suggestions]

    def _collect_suggestions(self, node, prefix, suggestions):
        if node.is_end_of_word:
            suggestions.append(prefix)

        for char, child_node in node.children.items():
            self._collect_suggestions(child_node, prefix + char, suggestions)

def autocomplete_demo():
    """Show exactly how the trie builds suggestions step by step"""

    # Create a small, controlled dataset for clear demonstration
    sample_fruits = [
        {"name": "Apple", "color": "Red", "region": "North America"},
        {"name": "Apricot", "color": "Orange", "region": "Europe"},
        {"name": "Banana", "color": "Yellow", "region": "South America"},
        {"name": "Cherry", "color": "Red", "region": "Europe"},
        {"name": "Coconut", "color": "Brown", "region": "Pacific Islands"}
    ]

    fruit_trie = FruitTrie()

    print("Building Trie with Sample Fruits:")
    print("=" * 40)

    for fruit in sample_fruits:
        fruit_trie.insert_fruit(fruit)
        print(f"Added: {fruit['name']}")

    print("\nTesting Autocomplete:")
    print("=" * 40)

    # Test various prefixes
    test_cases = [
        ("a", "Should find Apple and Apricot"),
        ("ap", "Should find Apple and Apricot"),
        ("app", "Should find only Apple"),
        ("c", "Should find Cherry and Coconut"),
        ("co", "Should find only Coconut"),
        ("z", "Should find nothing")
    ]

    for prefix, explanation in test_cases:
        suggestions = fruit_trie.get_autocomplete_suggestions(prefix)
        names = [fruit['name'] for fruit in suggestions]
        print(f"'{prefix}' → {names} ({explanation})")

autocomplete_demo()

Keep in mind this is an educational, fits-in-a-book example, so consider it a starting place for your own further explorations. Try integrating it with the code that generates all of the fruit data or consider making a GUI using Python/Flask or migrate it to a language of your choice. I really do encouage you to play with it to get a feel for search.

Remember that the real power of our fruit data is the way in which it can be searched. Perhaps there’s a client who is a fruit photographer who wants to do an expose on all of the purple fruits grown in South America. Or maybe a fruit chef who wants to create a recipe that uses all of the fruits that are yellow and in season in the Mediterranean.

By combining inverted indexes with trie-based search, multi-term and ranked searches become possible. Color and season, or region and size, or texture and vitamin C content can all be combined to break our fruit data down into tasty, useful morsels.

What do you think such a result might look like? Give it some thought before reading on. Here’s an initial idea of how that might happen:

def multi_term_search(fruits, search_terms):
    results = []
    for fruit in fruits:
        if all(term.lower() in fruit['name'].lower() or term.lower() == fruit['color'].lower() or term.lower() == fruit['region'].lower() for term in search_terms):
            results.append(fruit)
    return results

Ok, not a bad start. We’ll go through all the fruits and find the ones that match all the terms. Should be great, especially since we’ve already done the indexing, right?

That was a gotcha question. While it solves the problems, this is not at all a good start. First of all, it’s a linear search, using a chain of conditions. We’re right back to O(n) time. Second of all, we’re not using the inverted index we worked to construct. Most interviewers will not try to trick you like this, but it’s easy in the fog of being interviewed and the exhaustion that comes with nervous energy to un-optimize your code. Be careful of this.

Here’s a much more carefully considered example:

def multi_term_search_with_inverted_index(search_terms, inverted_index):
    """Use inverted index for fast multi-term search"""
    if not search_terms:
        return []

    # Get fruits matching the first term
    first_term = search_terms[0].lower()
    if first_term not in inverted_index:
        return []

    # Start with all fruits matching the first term
    results = set(fruit['id'] for fruit in inverted_index[first_term])

    # For each additional term, intersect with matching fruits
    for term in search_terms[1:]:
        term_lower = term.lower()
        if term_lower not in inverted_index:
            return []  # No fruits match this term

        term_fruit_ids = set(fruit['id'] for fruit in inverted_index[term_lower])
        results = results.intersection(term_fruit_ids)

    # Convert back to fruit objects
    return [fruit for fruit in inverted_index.values() if fruit['id'] in results]

Now we’re cooking with fruit! This modified code uses the inverted index to quickly find fruits that match all of the search terms. It starts with the first term, finds all fruits that match it, and then intersects that set with the sets of fruits that match each subsequent term. This is much faster than the previous linear search, especially as the number of search terms increases.

But still, we’re only using boolean search. Something matches the criteria or it doesn’t. Fruits don’t pick sides and we shouldn’t either, so how can the search be improved to account for “close” search? It’s as simple as adding a weight, as discussed in the graphs chapter.

def ranked_multi_term_search(search_terms, inverted_index, all_fruits):
    """Ranked search that scores and orders results by relevance"""

    fruit_scores = {}  # fruit_id -> relevance score

    for term in search_terms:
        term_lower = term.lower()
        if term_lower in inverted_index:
            for fruit in inverted_index[term_lower]:
                fruit_id = fruit['id']

                # Score based on multiple factors
                score = 0

                # Exact name match gets highest score
                if term_lower == fruit['name'].lower():
                    score += 10
                elif term_lower in fruit['name'].lower():
                    score += 5

                # Color match gets medium score
                if term_lower == fruit['color'].lower():
                    score += 3

                # Region match gets lower score
                if term_lower == fruit['region'].lower():
                    score += 2

                # Accumulate scores for this fruit
                if fruit_id not in fruit_scores:
                    fruit_scores[fruit_id] = 0
                fruit_scores[fruit_id] += score

    # Sort by score (highest first) and return top results
    sorted_results = sorted(fruit_scores.items(), key=lambda x: x[1], reverse=True)

    # Convert back to fruit objects
    fruit_lookup = {fruit['id']: fruit for fruit in all_fruits}
    return [(fruit_lookup[fruit_id], score) for fruit_id, score in sorted_results[:10]]

# Usage example
def demonstrate_ranked_search():
    fruits = generate_fruit_data(100)
    inverted_index = create_comprehensive_inverted_index(fruits)

    search_terms = ["red", "apple", "sweet"]
    results = ranked_multi_term_search(search_terms, inverted_index, fruits)

    print("Ranked Search Results:")
    print("=" * 40)
    for i, (fruit, score) in enumerate(results, 1):
        print(f"{i}. {fruit['name']} ({fruit['color']}, {fruit['region']}) - Score: {score}")def ranked_multi_term_search(search_terms, inverted_index, all_fruits):
    """Ranked search that scores and orders results by relevance"""

    fruit_scores = {}  # fruit_id -> relevance score

    for term in search_terms:
        term_lower = term.lower()
        if term_lower in inverted_index:
            for fruit in inverted_index[term_lower]:
                fruit_id = fruit['id']

                # Score based on multiple factors
                score = 0

                # Exact name match gets highest score
                if term_lower == fruit['name'].lower():
                    score += 10
                elif term_lower in fruit['name'].lower():
                    score += 5

                # Color match gets medium score
                if term_lower == fruit['color'].lower():
                    score += 3

                # Region match gets lower score
                if term_lower == fruit['region'].lower():
                    score += 2

                # Accumulate scores for this fruit
                if fruit_id not in fruit_scores:
                    fruit_scores[fruit_id] = 0
                fruit_scores[fruit_id] += score

    # Sort by score (highest first) and return top results
    sorted_results = sorted(fruit_scores.items(), key=lambda x: x[1], reverse=True)

    # Convert back to fruit objects
    fruit_lookup = {fruit['id']: fruit for fruit in all_fruits}
    return [(fruit_lookup[fruit_id], score) for fruit_id, score in sorted_results[:10]]

# Usage example
def demonstrate_ranked_search():
    fruits = generate_fruit_data(100)
    inverted_index = create_comprehensive_inverted_index(fruits)

    search_terms = ["red", "apple", "sweet"]
    results = ranked_multi_term_search(search_terms, inverted_index, fruits)

    print("Ranked Search Results:")
    print("=" * 40)
    for i, (fruit, score) in enumerate(results, 1):
        print(f"{i}. {fruit['name']} ({fruit['color']}, {fruit['region']}) - Score: {score}")

Fuzzy Search and Typo Tolerance

In real-world applications, users make typos. A search for “Abnana” should probably return “banana” results. Fuzzy search handles this by:

Performance at Scale

Caching and Optimization

User Experience Considerations

Sort

Now that we’ve bit off that large chunk of search we’re ready to move on to sort. We’ll consider sort both on its own and as it relates to search, with the goal, as always, towards building an intuitive understanding that can be used to solve problems in interviews.

What does “sort” mean?

Sort means putting things in some kind of order, usually either ascending or descending. The order can be determined by many things but usually involves some kind of sequence of information. For instance, a list of sports players can be sorted in a variety of names given a specific sequence. The players could be sorted by name, age, a given statistic, eye color, distance from home to stadium, pet type, number of children, favorite candy, etc. When we talk about sort we’re talking about putting things in order to create a context for comparison.

Things Used to Come With Instructions

Not so long ago, computers actually came with instruction manuals. There was a time when not everyone just grew up knowing how to use a mouse. In fact, games like Minesweeper and Solitare were included with Windows for that exact purpose — teaching people how to use a mouse.

The earliest versions of the Java Development Kit (JDK) that came out in the mid 1990’s used to come with an applet called “SortDemo.” Running this demo would allow the user to choose between different sorting algorithms, such as bubble sort, insertion sort, and quick sort, and then watch as the applet sorted a list of numbers in real time.

In the days before AI and YouTube, Java was a great way to get started with programming. It was readily accessible and could run pretty much anywhere. Applets are now sadly a thing of the past, but the legacy of the SortDemo lives on. Check out “algs4.jar,”" Princeton University’s library of Java and Python apps aimed at teaching algorithms. It can be found at https://algs4.cs.princeton.edu/code/.

How is sort used?

Sort can be used either simply to put information in order on its own, or in service of another process. Some search algorithms benefit from the information first being sorted, as was discussed in the last section on search.

Why sort is important to understand

Like search, sort isn’t one specific thing but a range of solutions to various problems. Understanding sort is important because it allows you to choose the right algorithm for the right problem, and to understand the tradeoffs involved in each choice.

Building Intuition: Reasoning about Sort

The basic truth about sort is a simple one: can a set of data be organized in a way that makes it easier to reason about. If the answer is yes, then it’s time to sort. Like with search, there is a tradeoff between speed and size, as well as a difference between sorting small sets of data and larger ones.

This section will teach the basic concepts, but it’s important to download and play with the code, try it out on different sets of data, run it against a timing algorithm, and really understand the key differences between different approaches.

Types of Sort Part I, Basic Sorting

During the 1970’s and 1980’s when computer science was preparing for the days ahead when computers could actually do all of the things computer scientists dreamed about, a lot of thought was put into the most effective ways to search. There are a lot of different sorting algorithms out there, some very general and some hyper-specialized.

I’m pretty sure an entire book could be written about sort alone. Since this is of course not that book, I’m going to start with an overview of some of the more commonly used and easy-to-understand versions of sort. Again, and as with search, I invite you to play with these examples, use them with your own data, and gain a real understanding of the differences. Also again and as with search, your job interview coding test might not contain a question that is specifically about search, but the reasoning skill you will gain from understanding search might help you solve the problem.

The algorithms selected in this chapter are meant to cover an array of sorting strategies, culminating in merge sort. Merge sort is really a culmination of many ideas presented elsewhere in this book. If there’s one thing you take away from this chapter, focus on understanding merge sort.

Insertion and bubble sort were covered in Chapter 4 in the context of arrays. But sort is data structure agnostic, to some extent. Whether the data is stored in a flat file, array, object, dictionary or database, these sorting methods can all be applied.

Insertion sort

In Chapter 4 we covered bubble sort and insertion sort in that order. In this chapter I’m going to reverse the order to make a slightly different point. Let’s start with insertion sort this time because it’s fairly intuitive. The idea behind insertion sort is that you move things around until they’re sorted.

Let’s start with the intuition: say you collect sneakers. You want to come up with a way to organize your sneakers in your closet. You do this but putting your favorite sneaker on the left side of the closet because it “feels” like it goes first. Then, one-by-one, you put each remaining pair of sneakers into the line. If it feels like it goes second, you put it to the right of the first pair. Or maybe you don’t like it so much, and so you put it at the end of the line of sneakers.

Every sneaker that comes in has to fit somewhere in this schema. Nothing will go to the left of your first sneaker, because that’s your favorite. But everything else will go between the pair of sneakers that’s a little bit better, and a little bit worse. By the time you get to the bottom of the pile your sneakers are now sorted in a line from best to worst. (You also could have done this with any other criterion including color, price, drip, height, etc.)

Going back to our fruit example, let’s look at how insertion sort can be used to sort all of the fruits alphabetically:

def insertion_sort_catalog(fruits):
    """Sort fruits alphabetically"""
    for i in range(1, len(fruits)):
        current_fruit = fruits[i]
        j = i - 1
        # Find the right spot and insert
        while j >= 0 and fruits[j]['name'] > current_fruit['name']:
            fruits[j + 1] = fruits[j]
            j -= 1
        fruits[j + 1] = current_fruit

This good, intuitive way to build a search, and probably what most of us would do if we were asked to come up with our own search algorithm. The idea is right down the line and stick each fruit exactly where it belongs. You should notice either from looking at the code or from remembering Chapter 4 that this of course comes with a O(n^2) time complexity, because you have a loop nested inside a loop. If you have just a little bit of data, you could do a lot worse than to use insertion sort. Which brings us to…

Bubble sort

Bubble sort definitely looks like an improvement over insertion sort. In bubble sort you go right down the line and compare every item with the item right next to it. If they’re out of order, you swap them. Quite the optimization, eh?

def bubble_sort_prices(fruits):
    """Sort fruit by price"""
    n = len(fruits)
    for i in range(n):
        for j in range(0, n - i - 1):
            # Compare every adjacent pair
            if fruits[j]['price_per_lb'] > fruits[j + 1]['price_per_lb']:
                fruits[j], fruits[j + 1] = fruits[j + 1], fruits[j]

Bubble sort is less code (by two lines) and easier to comprehend and remember. Sure it’s O(n^2), but that’s the same as insertion sort, right? You’re probably used to my style by now, so when I ask if something’s right, you’ve probably figured the answer is no, it’s not right. Well this is no exception!

Even though bubble sort is a slightly more “obvious” solution, and even though the time complexity is indeed O(n^2) for both, bubble sort can actually require quite a few more comparisons.

The reason why is that insertion sort goes over each item once, puts it where it belongs, and moves on to the next item. Bubble sort compares everything over and over again — including things that are already sorted before! That might be fine for apples and bananas, which begin with letters near the beginning of the English alphabet, but if you had an employee that checked if everthing was already sorted before filing away the raspberries and soursop you would quickly suggest that one of the two of you was in the wrong line of work.

Again, both algorithms are O(n^2). Asymptotic notation is a great place to start in algorithm comparison, but remember that it usually describes the limit of the algorithm, and sometimes that’s just not the whole story. “Run down to the farthest edge of the property and I’ll give you a dollar,” is a far more appealing proposition to the resident of a New York City studio apartment occupant than it is to a Wisconsin farm dweller. In Wisconsin you could be running for miles to get that dollar while in New York you might not even have to get off your couch.

Selection sort

Selection sort works in a simple fashion. Take the smallest item and put it at the front of the array/queue/stack. Then select the next smallest item and put it next. Continue doing this until all of the items are sorted. It works the same as selection sort with one significant difference. Insteading of doing all of the comparing that’s done with insertion sort, selection sort maintains a part of the array that it already considers to be sorted.

It’s kind of like opening birthday presents. Most people, given a stack of presents, open them in one of two ways: biggest one first, or biggest one last. Either way, once a present is opened, it does not get returned to the candidate pile. It’s been opened, it’s no longer under consideration.

def selection_sort_fruits(fruits):
    """Sort fruits by name using selection sort"""
    n = len(fruits)
    for i in range(n):
        # Find the minimum element in remaining unsorted array
        min_index = i
        for j in range(i + 1, n):
            if fruits[j]['name'] < fruits[min_index]['name']:
                min_index = j
        # Swap the found minimum element with the first element
        fruits[i], fruits[min_index] = fruits[min_index], fruits[i]

With selection sort a “min_index” is added. The “min_index” acts as a marker to separate the sorted from the unsorted. If an item from the unsorted side is the smallest item remaining, it gets moved to the front of the unsorted line and the marker increased by one. By the time this algorithm is finished, the items are sorted from smallest to largest. While the runtime of this algorithm is also O(n^2) like bubble sort and insertion sort, depending on the state of the starting list — especially if it’s already mostly sorted — selection sort can run in nearly O(n) time.

Types of Sort Part II, Advanced Sorting

If you’ve read this far and stopped, you would actually have a pretty good algorithmic basic for searching that would server you pretty well in an interview. But if there’s been a lingering doubt at the back of your mind, “O(n^2) is not good enough! There must be something better,” you’re absolutely right.

There are many sorting algorithms that run in better than O(n^2), and they all use the same basic idea: divide and conquer. Let’s pause for a moment to consider what that means.

There are three steps to divide and conquer:

Divide the sort space into smaller parts

Sort those smaller parts

Combine the smaller parts back into the original sort space

Et voila, you have a sorted list. This is the basic idea behind quick sort and merge sort, which are arguably (and unarguably) the two most widely used sorting algorithms in the computer science world.

Quick sort

In the examples so far we’ve been choosing very specific use cases to justify the types of sort used. But the fruit stand has grown and we can’t code our feelings any more, we need a more general purpose algorithm that can sort by more criteria. We need to sort fruit by price, calorie value, color, sweetness index, all of it. This is where quick sort comes in.

The way quick sort works is by starting with the “divide” step. Quick sort selects a “pivot” point in the data, and then organizes the data into things that are lower than the pivot, and things and are higher. This pivot can be found anywhere in the data, but since the data isn’t already sorted (right?) it’s just as useful to simply choose the first (or last) item in the list. Quick sort doesn’t get hung up on finding the “perfect” pivot, it just needs somewhere to start. Once the pivot is selected the data is split into the two parts mentioned earlier, lower than the pivot and higher than the pivot.

<Diagram of pivot splitting TK>

Now for the “conquer” step: Remember back in the last chapter when we talked about the recursive “base case.” Quick sort’s base case is sorted data. If you have an array with exactly one piece of data in it, it must be sorted, right? This is where quick sort finds the base case. More specifically, if quick sort selects the pivot, but doesn’t find anything to split to the left or right of it, then what’s left must be sorted, right? Take a moment to sit with that. It broke my brain a bit the first time I heard it.

<Diagram of base case TK>

With all of the pieces broken down into their smallest, sorted bits, quick sort has hit the base case and conquered! Now it unwinds the recursive stack to “combine” the data back together like a reverse-Godzilla, leaving nothing but order in its wake.

def quick_sort_fruits(fruits, key='name'):
    # Base case
    if len(fruits) <= 1:
        return fruits

    # DIVIDE: Choose pivot and partition
    pivot, less_than, greater_than = partition_fruits(fruits, key)

    # CONQUER: Recursively sort partitions
    sorted_less = quick_sort_fruits(less_than, key)
    sorted_greater = quick_sort_fruits(greater_than, key)

    # COMBINE: Merge results
    return combine_partitions(sorted_less, pivot, sorted_greater)

def partition_fruits(fruits, key):
    """
    Divide step: partition fruits around a pivot

    Returns:
        tuple: (pivot_fruit, fruits_less_than_pivot, fruits_greater_than_pivot)
    """
    pivot = fruits[0]
    pivot_value = pivot[key]

    less_than = []
    greater_than = []

    for fruit in fruits[1:]:
        if fruit[key] < pivot_value:
            less_than.append(fruit)
        else:
            greater_than.append(fruit)

    return pivot, less_than, greater_than

def combine_partitions(sorted_less, pivot, sorted_greater):
    """
    Combine step: merge sorted partitions back together
    """
    return sorted_less + [pivot] + sorted_greater

# Usage example
def demonstrate_quicksort():
    sample_fruits = [
        {"name": "Mango", "price_per_lb": 3.99, "sweetness": 8},
        {"name": "Apple", "price_per_lb": 2.49, "sweetness": 6},
        {"name": "Banana", "price_per_lb": 1.29, "sweetness": 7},
        {"name": "Cherry", "price_per_lb": 5.99, "sweetness": 5}
    ]

    print("Original fruits:")
    for fruit in sample_fruits:
        print(f"  {fruit['name']}: ${fruit['price_per_lb']}")

    # Sort by price
    sorted_by_price = quick_sort_fruits(sample_fruits, 'price_per_lb')

    print("\nSorted by price:")
    for fruit in sorted_by_price:
        print(f"  {fruit['name']}: ${fruit['price_per_lb']}")


demonstrate_quicksort()

Run the code and prepare to be amazed! No for loops for one thing. As always I ask you to pause for a moment and look over the code and consider the run time. Quick sort breaks things in half repeatedly. This should remind of a tree, as discussed in Chapter 9. Usually when you’re able to break a problem in half, repeatedly, the run time becomes logarithmic, and indeed quick sort can run as fast as O(n log n). Pretty cool, right?

I mentioned earlier that the first time I saw quick sort it broke my brain, and that wasn’t hyperbole. You might find yourself looking at this algorithm thinking “but what actually happens?” That’s the beauty of the base case. When a recursive algorithm hits the base case it knows it doesn’t have anywhere else to go, nothing else to recurse. It has effectively hit bottom. So then it begins to “unwind” in the same order things were passed to it, like the example of the TV news report in Chapter 11. That’s where the “sort” happens, and that’s the genius of quick sort.

But wait, up there I said quick sort “can” run as fast as O(n log n) and not “does” run at that speed. Let’s review some language from Chapter 2. Even though algorithms are most often discussed in terms of “Big O,” keep in mind there are other benchmarks. Big O talks about the “upper bound” of the algorithm, or the worst case. To talk about the “lower bound” or best case, we used Big Omega (Ω). For a “tight bound,” which for the purposes of this book can be read as “average case,” the notation is Big Theta (Θ).

While quick sort usually runs in O(n log n) time, that’s predicated on the idea that choosing the first item in an array as the pivot is an optimal selection. The problem is, it might not be. If the pivot is usually an average value — as you should expect if the data is fairly random — then everything works out fine. The problem comes in when the pivot is not an average value, and may in fact repeatedly be the worst possible value that can be chosen. This occurs when the data is already mostly sorted. If the item repeatedly chosen for the pivot is always either the smallest or largest element, then the sort becomes linear and quick sort runs much closer to our old friend O(n^2) time, the same as all of the other sort algorithms we’ve considered.

There are ways to updo your quick sort so this doesn’t happen. You can pick three numbers at random and always choose the median as the pivot for example. Of course, the more you have to do to get the right numbers the less quick the algorithm, it’s a balancing act.

The takeaway: while quick sort is indeed O(n^2) it is also Θ(n log n). It will definitely work in n log n time with the right kind of data. But is there an algorithm that can dispense with the maybe and always run at O(n log n)? Read on, dear reader, and find out.

Merge sort

If you were brave enough to weather that section break you have at last arrived at merge sort, the holy grail, the magic temple at the top of the mountain. Conceptually, anyway. Quick sort might be a bit more widely used and a bit more of a general purpose algorithm, but merge sort adds a few conceptual details that will really help cement an understanding of sort algorithms.

Merge sort is primarily different from quick sort in that while quick sort does the hard work up front in the divide step, merge sort does the hard work in the conquer step. Merge sort does the “divide” step by always splitting a list exactly in half, with no regard to anything other than finding the middle. The “conquer” part is the same as quick sort’s: dividing the lists down until you come to a lone item, which is, by virtue of its very existence, sorted.

<Image of Merge Sort “divide” TK>

It’s the “combine” step where the merge sort magic happens. Consider two lists, list A and list B. Your goal is to organize these two lists into a combined list from smallest to largest. Merge sort will start at the first item in each list. It will choose the smaller item between the two lists, and merge it into combined list. Then it will go back to the next item in the list it chose from, and do the comparison again. It will do this item by item until both lists are empty, at which point the combined list will be in sorted order.

<Image of Merge Sort “combine” TK>

Here is an example of merge sort in action:

def merge_sort_fruits(fruits, key='name'):
    """Sort fruits using merge sort algorithm"""
    # Base case
    if len(fruits) <= 1:
        return fruits

    # DIVIDE: Always split exactly in half
    mid = len(fruits) // 2
    left_half = merge_sort_fruits(fruits[:mid], key)
    right_half = merge_sort_fruits(fruits[mid:], key)

    # COMBINE: Merge the sorted halves
    return merge_sorted_halves(left_half, right_half, key)

def merge_sorted_halves(left, right, key):
    """
    Combine step: merge two sorted fruit lists into one sorted list

    Args:
        left: First sorted list of fruits
        right: Second sorted list of fruits
        key: The attribute to sort by

    Returns:
        Combined sorted list of fruits
    """
    merged_result = []
    i = j = 0

    # Compare first fruits from each sorted list and merge
    while i < len(left) and j < len(right):
        if left[i][key] <= right[j][key]:
            merged_result.append(left[i])
            i += 1
        else:
            merged_result.append(right[j])
            j += 1

    # Add any remaining fruits from left list
    while i < len(left):
        merged_result.append(left[i])
        i += 1

    # Add any remaining fruits from right list
    while j < len(right):
        merged_result.append(right[j])
        j += 1

    return merged_result

# Usage example
def demonstrate_mergesort():
    """Demonstrate merge sort with sample fruit data"""
    sample_fruits = [
        {"name": "Mango", "price_per_lb": 3.99, "sweetness": 8},
        {"name": "Apple", "price_per_lb": 2.49, "sweetness": 6},
        {"name": "Banana", "price_per_lb": 1.29, "sweetness": 7},
        {"name": "Cherry", "price_per_lb": 5.99, "sweetness": 5}
    ]

    print("Original fruits:")
    for fruit in sample_fruits:
        print(f"  {fruit['name']}: ${fruit['price_per_lb']}")

    # Sort by price using merge sort
    sorted_by_price = merge_sort_fruits(sample_fruits, 'price_per_lb')

    print("\nSorted by price (merge sort):")
    for fruit in sorted_by_price:
        print(f"  {fruit['name']}: ${fruit['price_per_lb']}")

demonstrate_mergesort()

Cool! What’s the catch? The catch is that merge sort requires additional space of at least the length of the list in order to work, because it needs to break the starting list into another list of the same size. Quick sort, by comparison, sorts the items in place, and so does not require this additional space. No big deal if you’re dealing with just a few (hundred) items in a system with plenty of memory, but add a few (hundred thousand) more on a system with limited storage and merge sort might not be the best idea.

Summary of Sort Algorithms

Just like that you have come to the end of the second part of the book! If you’ve made it this far you’re in some very good shape for a job interview. Consider sitting with this section for a little bit, or going back through Section I to see if some of the ideas you might have glossed over the first time make more sense now. You might even see them in a different context!

Example Questions

Anagrams

Partitioning

Optimization