home

Algorithms To Live By

Optimal Stopping

Do not stop too early, do not stop too late.

Look and Leap

  1. Look: Set a predetermined amount of candidates or time. Only gather data.
  2. Leap: Pick the first candidate that outshines all the candidates in look phase.

The Secretary Problem

Problem Definition

Optimal Stopping Simulation Using Core Python - 3 Secretaries - 1,000,000 runs

Do not hire the 1st candidate. Hire the 2nd candidate if they are better compared to 1st, otherwise hire 3rd.

import random

random.seed(42)

iteration_count = 1000000
right_choice_made = 0

for i in range(iteration_count):
    assigned_nums = []

    while len(assigned_nums) != 3:
        rand_ordinal = random.randint(0, 2)
        if rand_ordinal in assigned_nums:
            continue
        assigned_nums.append(rand_ordinal)

    if assigned_nums[2] > assigned_nums[1]:
        right_choice_made += 1

print(right_choice_made / iteration_count) # 0.500469

Optimal Stopping Simulation Using Pandas - 100 Secretaries - 1,000,000 runs

Do not hire any candidate from the first 37 percentile, store the value of best in this phase. Starting from 38 percentile, hire the first candidate encountered where the candidate is better compared to best observed in first 37 percentile. Following this strategy will lead to hiring the best candidate 37% of the time, the best you can have.

import numpy as np
import pandas as pd

rng = np.random.RandomState(42)

numOfSecretaries = 100
numOfTries = 1000000
optimal_stopping_location = int(numOfSecretaries * 0.37)

secretaries = pd.DataFrame(rng.choice(np.arange(0, 100000), size=[numOfSecretaries, numOfTries]))
training = secretaries[0:optimal_stopping_location]
threshold = training.max()

decision = secretaries[optimal_stopping_location:].reset_index(drop=True)

df = decision[decision > threshold]
df = df.idxmax() == df.apply(pd.Series.first_valid_index).values
print(np.sum(df) / numOfTries) # 0.370514

Walkthrough

An explanation of what is going on in the above implementation with a smaller set of data: 15 candidates, 5 runs.

import numpy as np
import pandas as pd

rng = np.random.RandomState(42)

numOfSecretaries = 15
numOfTries = 5
optimal_stopping_location = int(numOfSecretaries * 0.37)

secretaries = pd.DataFrame(rng.choice(np.arange(0, 100), size=[numOfSecretaries, numOfTries]))
# Initial DataFrame representing secretary points.
print(secretaries)
#      0   1   2   3   4
# 0   51  92  14  71  60
# 1   20  82  86  74  74
# 2   87  99  23   2  21
# 3   52   1  87  29  37
# 4    1  63  59  20  32
# 5   75  57  21  88  48
# 6   90  58  41  91  59
# 7   79  14  61  61  46
# 8   61  50  54  63   2
# 9   50   6  20  72  38
# 10  17   3  88  59  13
# 11   8  89  52   1  83
# 12  91  59  70  43   7
# 13  46  34  77  80  35
# 14  49   3   1   5  53

training = secretaries[0:optimal_stopping_location]
# DataFrame we will be using to adjust our threshold value.
print(training)
#     0   1   2   3   4
# 0  51  92  14  71  60
# 1  20  82  86  74  74
# 2  87  99  23   2  21
# 3  52   1  87  29  37
# 4   1  63  59  20  32

threshold = training.max()
# Our threshold value for each run.
print(threshold)
# 0    87
# 1    99
# 2    87
# 3    74
# 4    74
# dtype: int64

decision = secretaries[optimal_stopping_location:].reset_index(drop=True)
# DataFrame where we will be picking from. Remember: Pick the first value greater than threshold.
print(decision)
#     0   1   2   3   4
# 0  75  57  21  88  48
# 1  90  58  41  91  59
# 2  79  14  61  61  46
# 3  61  50  54  63   2
# 4  50   6  20  72  38
# 5  17   3  88  59  13
# 6   8  89  52   1  83
# 7  91  59  70  43   7
# 8  46  34  77  80  35
# 9  49   3   1   5  53

# Anything below here is simply me trying to figure out the first value greater than threshold.
# masked is a DataFrame where values lower than threshold are NaN
masked = decision[decision > threshold] # type: pd.DataFrame
print(masked)
#       0   1     2     3     4
# 0   NaN NaN   NaN  88.0   NaN
# 1  90.0 NaN   NaN  91.0   NaN
# 2   NaN NaN   NaN   NaN   NaN
# 3   NaN NaN   NaN   NaN   NaN
# 4   NaN NaN   NaN   NaN   NaN
# 5   NaN NaN  88.0   NaN   NaN
# 6   NaN NaN   NaN   NaN  83.0
# 7  91.0 NaN   NaN   NaN   NaN
# 8   NaN NaN   NaN  80.0   NaN
# 9   NaN NaN   NaN   NaN   NaN

# Now that we have `masked`, we will actually be picking the first !NaN value.
# For example for 0th run we will be picking 90, since that is the first value greater than our threshold.
# However, in this case, we are not actually picking the best candidate we can..
# There is a better candidate at index 7 with a value of 91! Tough luck..

# Basically the first index that is actually a value..
index_of_candidates_we_picked = masked.apply(pd.Series.first_valid_index).values
print(index_of_candidates_we_picked)
# [ 1. nan  5.  0.  6.]

# index of the actual best candidate..
idxmax = masked.idxmax()
print(idxmax)
# 0    7.0
# 1    NaN
# 2    5.0
# 3    1.0
# 4    6.0
# dtype: float64

# Have we made the right choice?
have_me_made_the_right_choice = (idxmax == index_of_candidates_we_picked)
print(have_me_made_the_right_choice)
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True

# Finding the number of times we made the best choice possible at this point is very easy..
number_of_right_choices = np.sum(have_me_made_the_right_choice)
print(number_of_right_choices)
# 2

# And so is finding the percentage..
print(number_of_right_choices / numOfTries)
# 0.4

Explore/Exploit

Although I have read this chapter a few times, I was not able to make sense of it. This is most likely due to me lacking the related and required mathematical skills for the topics presented here. Anyone out there who would like help me with this chapter, please do contact me, I will be grateful.

Sorting

How to Match Socks from a Laundry Bag?

Repeat the following until no socks left in the bag:

  1. Pick 1st random sock from the bag.
  2. Pick 2nd random sock from the bag.
  3. If match Remove both and goto 1.
  4. Else Toss 2nd sock back in and goto 2.

With just 10 different pair of socks, following this method will take on average 19 pulls merely to complete the first pair.

Finding Average Pulls Required

import random
from operator import add


def get_pair_of_socks(num_of_socks):
    return random.sample(range(num_of_socks), num_of_socks)


def index_to_pull_sock_from(bag_of_socks: list):
    return random.randint(a=0, b=len(bag_of_socks) - 1)


def attempt_counts_matching_socks(num_of_socks_to_consider):
    # Keep attempt counts in this list.
    attempt_counts = []

    # Generate pairs of random socks.
    socks = get_pair_of_socks(num_of_socks_to_consider) + get_pair_of_socks(num_of_socks_to_consider)

    while len(socks) != 0:
        # Pick one pair from the bag..
        first_pair = socks.pop(index_to_pull_sock_from(socks))

        # Pick a second pair..
        random_pick = index_to_pull_sock_from(socks)
        second_pair = socks[random_pick]

        # We did an attempt..
        attempt_count = 1

        # If they matched, perfect. We will never enter this block.
        # Otherwise loop until you do find the match..
        while second_pair != first_pair:
            # Increment the attempt_count whenever you loop..
            attempt_count = attempt_count + 1
            random_pick = index_to_pull_sock_from(socks)
            second_pair = socks[random_pick]

        # Remove the second matching pair from the bag..
        socks.pop(random_pick)

        # Keep the number of attempts it took you to find the second pair..
        attempt_counts.append(attempt_count)

    return attempt_counts


num_of_iterations = 1000
pair_of_socks = 10

# Initalise a list full of zeros of length `pair_of_socks`
attempt_counts_so_far = [0] * pair_of_socks

for _ in range(num_of_iterations):
    # Get attempt counts for 1 iteration..
    attempt_counts_single_iteration = attempt_counts_matching_socks(pair_of_socks)

    # Add the attempt counts aligned by index. 
    # We will be dividing by the total number of iterations later for averages.
    attempt_counts_so_far = list(map(add, attempt_counts_so_far, attempt_counts_single_iteration))

average_takes = list(map(lambda x: x / num_of_iterations, attempt_counts_so_far))
print(average_takes)
# [18.205, 16.967, 14.659, 12.82, 11.686, 9.444, 7.238, 4.854, 2.984, 1.0]

Matching vs Sorting

But is matching socks from a laundry bag really identical to (or a good real life analogy of) sorting?
Obviously you can not sort your socks but imagine there were numbers between 0 to 19 in the bag.

How would matching socks be identical to sorting?

Big-O Notation

Big-O Notation for Sorting

Bubble Sort

Time complexity of Bubble Sort is O(n²). In other words, as the size of the list that is being sorted increases by a multiple of 2, time complexity increases by n² = 4.

Bubble Sort Implementation in Python

Note how comparison count increases roughly by 4 (6, 30, 132) as the length of the lists increase by 2 (3, 6, 12).

def bubble_sort(list_to_sort):
    comparison_count = 0
    unsorted = True
    while unsorted:
        unsorted = False
        for i in range(len(list_to_sort) - 1):
            comparison_count = comparison_count + 1
            if list_to_sort[i] > list_to_sort[i + 1]:
                unsorted = True
                list_to_sort[i + 1], list_to_sort[i] = list_to_sort[i], list_to_sort[i + 1]

    return list_to_sort, comparison_count

print(bubble_sort([3, 2, 1]))
# ([1, 2, 3], 6)

print(bubble_sort([6, 5, 4, 3, 2, 1]))
# ([1, 2, 3, 4, 5, 6], 30)

print(bubble_sort([12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]))
# ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 132)

Insertion Sort

Insertion Sort Implementation in Python

def insertion_sort(list_to_sort):
    i = 1
    while i < len(list_to_sort):
        temp = list_to_sort[i]
        j = i
        while j !=  0 and list_to_sort[j - 1] > temp:
            list_to_sort[j] = list_to_sort[j - 1]
            j = j - 1
        list_to_sort[j] = temp
        i = i + 1
    return list_to_sort

print(insertion_sort([5, 4, 3, 2, 1]))
# [1, 2, 3, 4, 5]

Merge Sort

Merge Sort performance is O(n log n).

Merge Sort is as important in the history of sorting as sorting in the history of computing.

Merge Sort Implementation in Python

def sort_merge_step(list_01: list, list_02: list = None):
    if list_02 is None:
        return list_01

    merge_sorted = []
    i = 0
    j = 0

    while i < len(list_01) and j < len(list_02):
        if list_01[i] <= list_02[j]:
            merge_sorted.append(list_01[i])
            i = i + 1
        else:
            merge_sorted.append(list_02[j])
            j = j + 1

    while i < len(list_01):
        merge_sorted.append(list_01[i])
        i = i + 1

    while j < len(list_02):
        merge_sorted.append(list_02[j])
        j = j + 1

    return merge_sorted


def sort_merge(a_list: list):
    if len(a_list) < 2:
        return a_list

    mid = int(len(a_list) / 2)

    left_half = sort_merge(a_list[0: mid])
    right_half = sort_merge(a_list[mid:])

    return sort_merge_step(left_half, right_half)

print(sort_merge([5, 1, 3, 8, 11, 2]))
# [1, 2, 3, 5, 8, 11]

Quotes / Statements

Whether it's finding the largest or the smallest, the most common or the rarest, tallying, indexing, flagging duplicates, or just plain looking for the thing you want, they all generally begin under the hood with a sort.

I do not agree with this statement, since either finding the largest or the smallest, the most common or the rarest can easily be done without sorting. For finding the largest or the smallest, sorting may be useful, but it is definetly not useful at all for the most common or the rarest.

With sorting, size is a recipe for disaster: perversely, as a sort grows larger, the unit cost of sorting, instead of falling, rises.

Well, apparently!

Sorting five shelves of books will take not five times as long as sorting a single shelf, but twenty-five times as long.

I think what is meant is "Sorting a shelf five times longer will take twenty-five times longer."

References

Caching

The idea of keeping around pieces of information that you refer to frequently.

The goal of cache management is to minimize the number of times you can not find what you are looking for in the cache. Not being able to find what you are looking for in the cache is named as a page fault or a cache miss.

Cache Eviction

Cache eviction is the process of deciding what to remove from the cache when it is capacity is full but a new item needs to be cached.

The optimal cache eviction policy is to evict the item we will need again the longest from now.

Cache Eviction Policies

Quotes

A big book is a big nuisance.

Callimachus

I am not sure how this quote is related to caching really. It reminds me the following quotes, which I also like:

A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away.

Antoine de Saint-Exupery

It also reminds me a quote from The Information: A History, a Theory, a Flood, which I can not exactly remember but goes something like..

Too much information is just as bad as no information.

Randomness

A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random bits.

Source

Sampling

Buffon's needle

Following block of Python code is sampling the value of π by simulating dropping needles as explained in Buffon's needle.

import math
import random

iteration_count = 10000000
crossed = 0

needle_length = 1.0
gap_length = 2.0


for i in range(iteration_count):
    drop_point = random.uniform(-1, 1)
    # π / 2 = 157079632679
    drop_degree_rad = random.randint(0, 157079632679) / 100000000000

    tip = (math.sin(drop_degree_rad) * needle_length / 2) + drop_point
    bottom = drop_point - (math.sin(drop_degree_rad) * needle_length / 2)

    if math.fabs(tip) >= 1 or math.fabs(bottom) >= 1:
        crossed += 1

print(2 / crossed / iteration_count / 2)  # 3.140268574610112 Very close!

Prime Numbers

Sieve of Eratosthenes

Sieve of Eratosthenes Implementation - Version 1

# Find primes until this integer.. Modify to your taste..
upper_bound = 1000

# Pass a list of integers and a prime number, 
# will return you a list where numbers divisible by the prime you passed are removed.
def remove_non_primes(list_of_ints, prime_to_check_against):
    multiplier = 1
    while prime_to_check_against * multiplier < list_of_ints[-1]:
        multiplier += 1
        # Should be quite fast since list is sorted.
        if prime_to_check_against * multiplier in list_of_ints:
            list_of_ints.remove(prime_to_check_against * multiplier)
    return list_of_ints


ints = list(range(2, upper_bound)) # Generate initial list of integers

# Initial prime for bootstrap
prime = 2 
# Keep track of iteration count, so we know the next prime to work with in each iteration.
iteration_count = 0 

while True:
    # Use length of list to determine we removed everything we could by 
    # checking this length against returning lists length.
    original_list_length = len(ints)
    
    # Do the magic, remove all non primes found by parameter prime.
    ints = remove_non_primes(ints, prime)
    
    # List was not modified.. Nothing to remove.
    if original_list_length == len(ints): 
        break
    
    iteration_count += 1
    prime = ints[iteration_count]

print(ints)

Sieve of Eratosthenes Implementation - Version 2

def primes(up_to):
    if up_to < 2:
        return [False] * up_to

    # 0 indexed array hence the +1 so index is aligned with the integer value.
    prime_indices = [True] * (up_to + 1)

    # Special cases 0 and 1.. Not primes..
    prime_indices[0] = False
    prime_indices[1] = False

    for i in range(2, len(prime_indices)):
        if i * i >= len(prime_indices):
            break
        mul = i
        val = i * mul
        while val < len(prime_indices):
            prime_indices[mul * i] = False
            mul = mul + 1
            val = i * mul

    return prime_indices

Networking

The term connection has a wide variety of meanings. It can refer to a physical or logical path between two entities, it can refer to the flow over the path, it can inferentially refer to an action associated with the setting up of a path, or it can refer to an association between two or more entities, with or without regard to any path between them.

Vint Cerf and Bob Kahn, A Protocol for Packet Network Intercommunication

Circuit Switching vs Packet Switching

In Packet Switching, there are no connections. What you call a connection is a consensual illusion between two end points.

Stuart Cheshire

Acknowledgment

Exponential Backoff

Congestion

Buffer