Saturday, March 14, 2020

Why only 2 Weighted Medians at most?


It is intuitive to observe that in a distinct ordering of numbers, there are at most two medians.If the number of elements is odd, i.e. $2n-1$, then $n^{th}$ element is the median.
Whereas, if the number of elements is even, i.e $2n$, then $n^{th}$ and $(n+1)^{th}$ elements form the median.

Apart from median, there is also a concept called "weighted median".

The Definition 

In a given distinct ordering of pairs $\{(i_1 ,w_1) , (i_2, w_2) .. , (i_n, w_n)\}$:
------> Let  $i$ represent any arbitrary identity of a given element; $w$ represents the weight of that element in the sequence; and $W$ the sum of weights of all the elements
------> Then, an element $i_m$ is a weighted median if $\displaystyle\sum_{1 \le j \lt m} w_j \le \frac{W}{2}$ and $\displaystyle\sum_{m \lt j \le n} w_j \le \frac{W}{2}$

The Interpretation

One good use of weighted median is to compute the median of all the values when there are multiples/repetitions of same values in the sampled data.
So if you have $n$ values with average repetition $w_{avg}$ unordered, the median could be computed at $O(n.w_{avg})$.
But if you organize or have the data as distinct values and their repetitions, the complexity to find the median could come down to $O(n)$

Let's see why this is so:

After expanding an ordering of the form: (Lets refer to this as abridged ordering)
 $\{(i_1 ,w_1) , (i_2, w_2) .. , (i_n, w_n)\}$ .....
to an ordering of the form: (lets call this expanded ordering)
 $\{i_1, i_1,...(w_1times), i_2, i_2, ....(w_2 times), ...., i_n, i_n, ... (w_n times) \}$
The total number of elements in the ordering would be:
$\displaystyle\sum_{1\le j \le n} w_j = W$.

Now we can easily say that the median of such an ordering would be $(\frac{W+1}{2})^{th}$ element if $W$ turns out to be odd.
And if its even, the median comprises of $(\frac{W}{2})^{th}$and $(\frac{W}{2} + 1)^{th}$elements.
Let a given repetition of $i_m$ be the median in the above expanded ordering, and lets also consider $W$ to be odd.
Then, the number of elements in the expanded ordering before and after the selected instance of $i_m$ must exactly be $W/2$.
Note that, for any element other than median, this property will not hold good.
Now if we extract out the copies of $i_m$ on its left, and right, if any, we can establish that the number of  non $i_m$ elements before and after $i_m$ is at most $W/2$.
Hence, if we calculate the weighted median of abridged ordering, it would be same as the median of the expanded ordering.
Similar argument can be made for when $W$ is even, and also the special case when such ordering has two distinct medians $i_m$ and $i_{m+1}$
From this analysis on using weights as repetitions, it is easy to see, that like median, the weighted medians can also be at most two.

The observation

But what if we focus on just the plain simple definition of weighted median without interpreting the weights as repetitions or such, and were to "observe" that at most, there can just be two weighted medians.... How would we establish that fact? Well, its not too hard.
Let $i_m$ be a weighted median in a given abridged ordering, then,

$\displaystyle\sum_{1 \le j \lt m} w_j + w_m + \displaystyle\sum_{m \lt j \le n} w_j = W$

Let $A = \displaystyle\sum_{1 \le j \lt m} w_j$, and $B = \displaystyle\sum_{m \lt j \le n} w_j$

  Then $A + w_m + B = W$ and by definition of weighted median, $A \le W/2$ and $B \le W/2$

Now, if we assume that: $w_m + B \le W/2$,  then we can claim that $w_{m-1}$ is also a weighted median.

This is because: $A \le W/2 \implies A - w_{m-1} \le W/2$ and of course by our assumption, we have $w_m + B \le W/2$

We have three relations to satisfy here: $1: A + w_m + B = W; 2: w_m + B \le W/2;$ and $3: A \le W/2$. 

All the three can be satisfied if and only if $4: A = W/2$ and $5: w_m + B = W/2$

Given that we already have weighted medians $w_{m-1}$ and $w_m$, the requirements in 4: and 5: above make it impossible to have any other weighted median.
Similar argument can also be made assuming that $A + w_m \le W/2$

For me this understanding proved to be more fun than actually writing the R-Select code to find out the weighted median itself, and hence come these notes πŸ˜€Though, i actually ended up describing the algorithm here πŸ˜›




Thursday, March 05, 2020

2-D Peak-A-Proof


Hmm 😡.. Capacity for proofs... how important is that?
No logic is sound unless it is proved. The ability to prove is a testament to one of the three things:
 a) clarity of thought,
 b) loudness of mouth,
 c) willingness to self deceive πŸ˜›.
But more often than not, i guess its useful when its a); c) is not bad either, for it counter balances its other opposite, which is self doubt πŸ˜‰
Somewhere between self doubt and self deceit lies self confidence, right? Not sure of b) but who knows, its probably used pretty widely, or? πŸ˜†


Ahem ahem... course correction.. back to the utility of proofs...
Take for example the rather simple 2-D Peak finding problem.
The 2-D Peak finding problem:
From a given n-by-n matrix A, find a (i.e., any) number A(i, j) such that all valid elements in A at positions A(i-1, j), A(i+1, j), A(i, j-1), A(i, j+1) are less than or equal to A(i, j).
Valid element means that the position computed is a valid legal position in the matrix.
For example: A(-1, 0) is NOT a valid element.
Simply put, an element is a peak, when the elements if any, immediately above, below, to the left, and to the right of it, are less than or equal to the given element.

In the example 5-by-5 matrix below, the peak elements are highlighted.
60 100 103 80 95
125 120 108 101 45
130 30 93 90 156
115 40 98 65 125
75 50 96 55 25

The Divide and Conquer algorithm(s) to efficiently find "a" peak is(are) rather simple, but, is the proof of their correctness as simple?
Well, i guess it varies from person to person. But for some reason, i have this inherent discomfort with anything more than one dimension.
And i get the urge to write about it as i have a feeling there maybe a few more confused souls like me who feel dazed thinking in more than one dimension πŸ˜›.

i tried for several hours to come up with a less than $O(n^2)$ algorithm, but just could not.

Started looking up for hints and tried the....
1-D Peak finding problem:
A peak in an array A means any A(i) such that both A(i-1) and A(i+1) are less than or equal to A(i).
For the beginning and end elements of the array, A(0) is a peak if A(1) ≤ A(0) and A(end) is a peak if A(end-1) ≤ A(end). Find such a peak from the array A.

1-D Peak finding algorithm
"A" (i.e. any) peak in this array can be found in O(logn) time using the following strategy:
1. Take a middle element, A(m) and check if it is a peak.If so, return A(m).
2.  If not, either A(m+1) or A(m-1) or both would be greater than A(m)
3. Continue with step 1 on the right half, if A(m+1) >A(m)
    Continue with step 1 on the left half, if A(m-1) > A(m)
    If both A(m-1), and A(m+1) are greater than A(m), either half is fine.
Is this algorithm correct? Seems trivial enough to prove so.

[proof:1-D Peak finding algorithm]
Suppose A(m) is not a peak and A(m+1) > A(m).
In this case, A(m+1) will be a peak unless its own right neighbor, A(m+2) > A(m+1).
Lets assume A(m+2) meets this condition. Inductively, for A(m+2) to not be a peak, A(m+3) > A(m+2) and it can go on and on, only till the end of the array where the end element will turn out to be a peak.
Hence, when we follow the half where A(m+1) > A(m), we are guaranteed to find a peak.
We can argue similarly for the case when A(m-1) > A(m).
[proof:end]

Surprisingly so, this new found wisdom on the 1-D case got me no where with the 2-D case, and finally i looked up the algorithm 😁
However, i still could not convince myself very easily that the 2-D algorithm i looked up is actually correct.
Guess this is always the case with borrowed analysis, because it is less of an analysis and more of an assurance.

Anyways putting a stop to the sob, here is the most simple of the divide and conquer algorithms for the 2-D case:
O(nlogn) 2-D Peak finding algorithm
1. Find the maximum element of the middle column, let it be A(i, m). 
  If its a peak, return it and terminate.
2.a) If A(i, m-1) >= A(i, m) and A(i, m+1) ≤ A(i, m); then recurse in the left half of matrix.
2.b) Else If A(i, m+1) > A(i, m) and A(i, m+1)  ≤= A(i, m); then recurse in the right half of matrix.
2.c) Else If A(i, m-1) >= A(i, m) and A(i, m+1)  >= A(i, m); then recurse in either left or right half. (i.e., either of the halves is guaranteed to have a peak)
3. If the recursion reduces the matrix to a single column, just return the max of that column and terminate.

Sounds neat and simple to code up, but is the algorithm correct?
How do we go about its correctness? It was a bit of mind muddling experience for me, because for once in the evening i thought i got the proof but couldn't really write it down. And when i tried to write it down before bed, i couldn't really come up with the proof again.. Wonder if the evening was a moment of self deceit or a volatile intuition πŸ€”. Well, so i decided to give it a go in the morning instead when luckily i guess i could reason the correctness correctly.

So here is a proof for the above binary search algorithm for the 2-D peak:
i changed my original proof a bit starting with the base case first. This way, it is simpler to understand and takes fewer words i think.
[proof:O(nlogn) 2-D Peak finding algorithm]
Consider the base case (Step 3), a single column and n rows of  elements. Lets call this column as c.
Its easy to see that the max of that column is a "peak" in that 1 column matrix. Lets call this peak as p.
Now, lets unwind the recursion, and add another column, c+1 of n elements to this matrix.
p ceases to be a peak in the new 2 column matrix only if column c+1 has an element greater than p in the same row as p.
But, Step 2.a) in the above algorithm ensures that not only the one next to p, but, all of the elements in column c+1 are less than p if column c+1 was skipped out of the search.
Fine, lets unwind a bit more, and add another column, c-1 of n elements to this matrix.
p ceases to be a peak in the new 3 column matrix only if column c-1 has an element greater than p in the same row as p.
However, Step 2.a) ensured that all the elements in column c-1 are indeed less than p if column c-1 was skipped out of the search.
Hence, inductively we can prove that when the algorithm hits the base case, it always finds a peak which is valid for the undivided input matrix
[proof:end]
The complexity of this algorithm is O(nlogn).
If you look at the design of this algorithm, you will notice, the main ingredient is to use the middle column as a virtual valley with the knowledge that the half next to this column has at least one element greater than any of the elements in this virtual valley, and hence there is a guarantee to find a peak in that particular half.

Now lets looks at the O(n) algorithm:
O(n)  2-D Peak finding algorithm
The O(n) algorithm extends this valley building effort both across the middle column as well as the middle row, and recurses into one of the quadrants where a peak is guaranteed to be found.
The algorithm goes as below:
1. Find the maximum element among all the elements of the middle row, $m_r$ and middle column, $m_c$
    If the maximum is at the intersection of the row and column, it is definitely a peak, so return it and terminate.
2. If the maximum is a peak, return it and terminate.
3. If the maximum is not a peak, recurse is the quadrant where adjacent neighbor of the maximum is greater than the maximum.
4. When recursion is left with only one row, return the max of that row and terminate. Or, if the recursion is left with only one column, return that max of that column and terminate. Or, if the recursion is only left with one element, that one element must be the peak, return it, and terminate.

Bit more details on step3
i am just writing more details on step3 to clarify and maybe also to aid the proof which is coming up.
3.a. If the maximum A($m_r$, j) is in the middle row, and j > $m_c$, and A($m_r$ + 1, j) > A($m_r$, j), then recurse in the quadrant bounded by rows $m_r$ +1 to end, and columns $m_c$ + 1 to end.Below 7 by 7 table gives an example of this case:
$m_c$ = $m_r$ = 4, maximum is A(4, 5) = 4520.
But A(5, 5) = 4610 > A(4, 5), Hence we recurse in the bottom right quadrant.

1923 3317 1784 4236 126 526 2583
4060 1068 1667 1263 3989 2461 4342
2764 139 4552 4029 502 2764 653
1347 2172 1614 2852 4520 4305 2199
4780 353 2606 1504 4610 3475 4515
1581 1132 719 38 1396 4285 4456
2607 3458 1651 3426 1078 4388 559
3328 2332 42 3865 378 3989 3959
3.b. Else if the maximum A($m_r$, j) is in the middle row, and j > $m_c$, and A($m_r$ - 1, j) > A($m_r$, j), then recurse in the quadrant bounded by rows 0 to $m_r$ -1, and columns $m_c$ + 1 to end. Below table gives an example of this case:
$m_c$ = 2, $m_r$ = 2, maximum is A(2,  3) = 4456.
But A(1, 3) = 4515 > A(2, 3), Hence we recurse in the top right quadrant.

4610 3475 4515
1396 4285 4456
1078 4388 559
378 3989 3959
Similarly it follows for other cases as below...
3.c. Else If the maximum A($m_r$, j) is in the middle row and j < $m_c$ , and A($m_r$ - 1, j) > A($m_r$, j), then recurse in the quadrant bounded by rows 0 to $m_r$ -1, and columns 0 to $m_c$ -1 
3.d. Else if the maximum A($m_r$, j) is in the middle row and j < $m_c$ , and A($m_r$ + 1, j) > A($m_r$, j), then recurse in the quadrant bounded by rows $m_r$ +1 to end, and columns 0 to $m_c$ -1 
3.e. Else if the maximum A(i, $m_c$) is in the middle column and i < $m_r$, and A(i, $m_c$ - 1) > A(i, $m_c$), then recurse in the quadrant bounded by rows 0 to $m_r$ -1  and columns 0 to $m_c$ -1
3.f. Else if the maximum A(i, $m_c$) is in the middle column and i > $m_r$, and A(i, $m_c$ - 1) > A(i, $m_c$), then recurse in the quadrant bounded by rows $m_r$ + 1 to end, and columns 0 to $m_c$ -1
3.g. Else if the maximum A(i, $m_c$) is in the middle column and i < $m_r$, and A(i, $m_c$ + 1) > A(i, $m_c$), then recurse in the quadrant bounded by rows 0 to $m_r$ -1  and columns $m_c$ + 1 to end.
3.h. Else if the maximum A(i, $m_c$) is in the middle column and i > $m_r$, and A(i, $m_c$ + 1) > A(i, $m_c$), then recurse in the quadrant bounded by rows $m_r$ + 1 to end, and columns $m_c$ + 1 to end.
       
Now that we stated the algorithm, rather elaborately, lets try to prove it by starting with base case.
[proof: O(n)  2-D Peak finding algorithm]
Consider the base case when only one row $r$ and columns $c_x$ to $c_y$ are left.
The maximum of that row, say, A(r,j) is the peak of that one row, r.
Lets unwind the recursion and add back the row r-1 to the array.
A(r, j) ceases to be a peak only if A(r-1, j) > A(r, j)
And steps 3.a, 3.d, 3.f, 3.h are the ones responsible for peeling off rows above.
Because of preconditions in the those steps, not only  A(r-1, j), but for all c such that $c_x$ < c < $c_y$; A(r-1, c) ≤= A(r, j)

On similar lines, the proof can be extended to the case when we add back row r+1, as well as when the base case is a column.
[proof:end]