Thursday, April 15, 2021

little joy of sorting

What good is sorting unless you really need to sort? Take a look at this problem:
/**
 * Return the number of specially colored things.
 * Each thing has only one color on it.
 * 'thing' s are comparable based on their color 
 *   with a 'color' or another 'thing'.
 * 'color' s can be compared to each other as well.
 */
int find(vector special_colors, vector things)

Now we can implement it in two ways
1) The so mixed up way  :-)
int find(vector<color> special_colors, vector <thing> things) {
    int special_things = 0;
    for (auto thing: things) {
        for (auto special_color: special_colors) {
            if (thing == special_color) {
                special_things++;
            }
        }
    }
    return special_things;
}

It gets the job done, but errr, has a time complexity of O(m*n) where m = special_colors.size() and n = things.size() 

(OR)

 2) The all nice and sorted way ;-)

 int find(vector special_colors, vector things) {
     sort(special_colors.begin(), special_colors.end());
     sort(things.begin(), things.end());
     auto thing = things.begin();
     auto special_color = special_colors.begin();
     int special_things = 0;
     while (thing != things.end() && special_color != special_colors.end()) {
         if ( *thing == *special_color) {
             special_things++;
             thing++;
         } else if ( *thing < *special_color) {
             thing++;
         } else {
             special_color++;
         }
     }
     return special_things;
 } 
Now, this has an *average* time complexity of O(m*logm + n*logn) for sorting, and O(m+n) for counting (i.e., everything we do after sorting both the vectors), which is much better than O(m*n) 
 
Goes on to say, "Hey, better be sorted and save yourself some free time" ;-)

Quiz: For what kind of input, will the counting complexity be the most?

*average* - because, most often a hybrid algorithm to achieve average optimal performance is used to implement sort()

4 comments:

Ravikiran said...

Can we do better than O(mlogm +nlogn)?

~~^mytreya^~~ said...

@Ravi Can we do better than O(mlogm +nlogn)?
yes, if we can afford to add memory overhead by creating a dictionary

Something like this:


1 int find(vector special_colors, vector things) {
2      int special_things = 0;
3     unordered_map dict;
4
5     for (auto thing: things) {
6         auto it = dict.find(thing);
7         if (it != dict.end()) {
8             it -> second++;
9          } else {
10              dict.insert(std::make_pair(thing, 1);
11          }
12     }
13      for (auto special_color: special_colors) {
14          auto it = dict.find(special_color);
15          if (it != dict.end()) {
16              special_things += it->second ;
17          }
18      }
19      return special_things;
20 }

Then the theoretical average runtime cost would be, O(n) for creating the map, O(m) for searching and accumulating in the map.. plus memory overhead of O(n)

~~^mytreya^~~ said...

and if going the dictionary way, i guess its really important to also have

unordered_map dict;
dict.reserve(things.size());

so that appropriate number of buckets are pre-created and insertion remains faster without rehashing..

Anonymous said...

Yes I was about to write the same solution. We can achieve better runtime with compromising space complexity