Text Book - Algorithm Design by Jon Kleinberg and Eva Tardos
Chapter 11 - Approximation Algorithms
Exercise
Q1. Suppose you're acting as a consultant for the Port Authority of a small Pacific Rim nation. They're currently doing a multibillion-dollar business per year, and their revenue is constrained almost entirely by the rate at which they can unload ships that arrive in the port.
Here's a basic sort of problem they face. A ship arrives, with n containers of weight w1, w2,..., wn. Standing on the dock is a set of trucks, each of which can hold K units of weight. (You can assume that K and each wi is an integer.) You can stack multiple containers in each truck, subject to the weight restriction of K; the goal is to minimize the number of trucks that are needed in order to carry all the containers. This problem is NP-complete (you don't have to prove this).
A greedy algorithm you might use for this is the following. Start with an empty truck, and begin piling containers 1,2,3,... into it until you get to a container that would over flow the weight limit. Now declare this truck "loaded" and send it off; then continue the process with a fresh truck. This algorithm, by considering trucks one at a time, may not achieve the most efficient way to pack the full set of containers into an available collection of trucks.
(a) Give an example of a set of weights, and a value of K, where this algorithm does not use the minimum possible number of trucks.
(b) Show, however, that the number of trucks used by this algorithm is within a factor of 2 of the minimum possible number, for any set of weights and any value of K.
Q2. At a lecture in a computational biology conference one of us attended a few years ago, a well-known protein chemist talked about the idea of building a "representative set" for a large collection of protein molecules whose properties we don't understand. The idea would be to intensively study the proteins in the representative set and thereby learn (by inference) about all the proteins in the full collection.
To be useful, the representative set must have two properties.
- It should be relatively small, so that it will not be too expensive to study it.
- Every protein in the full collection should be "similar" to some protein in the representative set. (In this way, it truly provides some information about all the proteins.)
More concretely, there is a large set P of proteins. We define similarity on proteins by a distance function d: Given two proteins p and q, it returns a number d(p, q) ≥ 0. In fact, the function d(·, ·) most typically used is the sequence alignment measure, which we looked at when we studied dynamic programming in Chapter 6. We'll assume this is the distance being used here. There is a predefined distance cut-off Δ that's specified as part of the input to the problem; two proteins p and q are deemed to be "similar" to one another if and only if d(p, q) ≤ Δ.
We say that a subset of P is a representative set if, for every protein p, there is a protein q in the subset that is similar to it-that is, for which d(p, q) ≤ . Our goal is to find a representative set that is as small as possible.
(a) Give a polynomial-time algorithm that approximates the minimum representative set to within a factor of O(log n). Specifically, your algorithm should have the following property: If the minimum possible size of a representative set is s∗, your algorithm should return a representative set of size at most O(s∗ log n).
(b) Note the close similarity between this problem and the Center Selection Problem-a problem for which we considered approximation algorithms in Section 11.2. Why doesn't the algorithm described there solve the current problem?
Q3. Suppose you are given a set of positive integers A ={a1, a2,..., an} and a positive integer B. A subset S ⊆ A is called feasible if the sum of the numbers in S does not exceed B:
Σai∈S ai ≤ B.
The sum of the numbers in S will be called the total sum of S.
You would like to select a feasible subset S of A whose total sum is as large as possible.
Q4. Consider an optimization version of the Hitting Set Problem defined as follows. We are given a set A = {a1,..., an} and a collection B1, B2,..., Bm of subsets of A. Also, each element ai ∈ A has a weight wi ≥ 0. The problem is to find a hitting set H ⊆ A such that the total weight of the elements in H, that is, Σai∈H wi, is as small as possible. (As in Exercise 5 in Chapter 8, we say that H is a hitting set if H ∩ Bi is not empty for each i.) Let b = maxi |Bi| denote the maximum size of any of the sets B1, B2,..., Bm. Give a polynomial-time approximation algorithm for this problem that finds a hitting set whose total weight is at most b times the minimum possible.
Q5. You are asked to consult for a business where clients bring in jobs each day for processing. Each job has a processing time ti that is known when the job arrives. The company has a set of ten machines, and each job can be processed on any of these ten machines.
At the moment the business is running the simple Greedy-Balance Algorithm we discussed in Section 11.1. They have been told that this may not be the best approximation algorithm possible, and they are wondering if they should be afraid of bad performance. However, they are reluctant to change the scheduling as they really like the simplicity of the current algorithm: jobs can be assigned to machines as soon as they arrive, without having to defer the decision until later jobs arrive.
In particular, they have heard that this algorithm can produce solutions with make span as much as twice the minimum possible; but their experience with the algorithm has been quite good: They have been running it each day for the last month, and they have not observed it to produce a make span more than 20 percent above the average load, 1/10Σi ti.
To try understanding why they don't seem to be encountering this factor-of-two behavior, you ask a bit about the kind of jobs and loads they see. You find out that the sizes of jobs range between 1 and 50, that is, 1 ≤ ti ≤ 50 for all jobs i; and the total load Σiti is quite high each day: it is always at least 3,000.
Prove that on the type of inputs the company sees, the Greedy-Balance Algorithm will always find a solution whose make span is at most 20 percent above the average load.
Q6. Recall that in the basic Load Balancing Problem from Section 11.1, we're interested in placing jobs on machines so as to minimize the make span- the maximum load on any one machine. In a number of applications, it is natural to consider cases in which you have access to machines with different amounts of processing power, so that a given job may complete more quickly on one of your machines than on another. The question then becomes: How should you allocate jobs to machines in these more heterogeneous systems?
Here's a basic model that exposes these issues. Suppose you have a system that consists of m slow machines and k fast machines. The fast machines can perform twice as much work per unit time as the slow machines. Now you're given a set of n jobs; job i takes time ti to process on a slow machine and time ½ti to process on a fast machine. You want to assign each job to a machine; as before, the goal is to minimize the make span-that is the maximum, over all machines, of the total processing time of jobs assigned to that machine.
Give a polynomial-time algorithm that produces an assignment of jobs to machines with a make span that is at most three times the optimum.
Q7. You're consulting for an e-commerce site that receives a large number of visitors each day. For each visitor i, where i ∈{1,2,..., n}, the site has assigned a value vi, representing the expected revenue that can be obtained from this customer.
Each visitor i is shown one of m possible ads A1, A2,..., Am as they enter the site. The site wants a selection of one ad for each customer so that each ad is seen, overall, by a set of customers of reasonably large total weight. Thus, given a selection of one ad for each customer, we will define the spread of this selection to be the minimum, over j = 1, 2,...,m, of the total weight of all customers who were shown ad Aj.
Q8. Some friends of yours are working with a system that performs real-time scheduling of jobs onmultiple servers, and they've come to you for help in getting around an unfortunate piece of legacy code that can't be changed.
Here's the situation. When a batch of jobs arrives, the system allocates them to servers using the simple Greedy-Balance Algorithm from Section 11.1, which provides an approximation to within a factor of 2. In the decade and a half since this part of the system was written, the hardware has gotten faster to the point where, on the instances that the system needs to deal with, your friends find that it's generally possible to compute an optimal solution.
The difficulty is that the people in charge of the system's internals won't let them change the portion of the software that implements the Greedy-Balance Algorithm so as to replace it with one that finds the optimal solution. (Basically, this portion of the code has to interact with so many other parts of the system that it's not worth the risk of something going wrong if it's replaced.)
After grumbling about this for a while, your friends come up with an alternate idea. Suppose they could write a little piece of code that takes the description of the jobs, computes an optimal solution (since they're able to do this on the instances that arise in practice), and then feeds the jobs to the Greedy-Balance Algorithm in an order that will cause it to allocate them optimally. In other words, they're hoping to be able to reorder the input in such a way that when Greedy-Balance encounters the input in this order, it produces an optimal solution.
So their question to you is simply the following: Is this always possible? Their conjecture is,
For every instance of the load balancing problem from Section 11.1, there exists an order of the jobs so that when Greedy-Balance processes the jobs in this order, it produces an assignment of jobs to machines with the minimum possible make span.
Decide whether you think this conjecture is true or false, and give either a proof or a counterexample.
Q9. Consider the following maximization version of the 3-DimensionalMatching Problem. Given disjoint sets X, Y, and Z, and given a set T ⊆ X × Y × Z of ordered triples, a subset M ⊆ T is a 3-dimensional matching if each element of X ∪ Y ∪ Z is contained in at most one of these triples. The Maximum 3-Dimensional Matching Problem is to find a 3-dimensional matching M of maximum size. (The size of the matching, as usual, is the number of triples it contains. You may assume |X|=|Y|=|Z| if you want.)
Give a polynomial-time algorithm that finds a 3-dimensional matching of size at least 1/3 times the maximum possible size.
Q10. Suppose you are given an n × n grid graph G, as in Figure 11.13.
Associated with each node v is a weight w(v), which is a nonnegative integer. You may assume that the weights of all nodes are distinct. Your goal is to choose an independent set S of nodes of the grid, so that the sum of the weights of the nodes in S is as large as possible. (The sum of the weights of the nodes in S will be called its total weight.)
Consider the following greedy algorithm for this problem.
The "heaviest-first" greedy algorithm:
Start with S equal to the empty set
While some node remains in G
Pick a node vi of maximum weight
Add vi to S
Delete vi and its neighbors from G
Endwhile
Return S
(a) Let S be the independent set returned by the "heaviest-first" greedy algorithm, and let T be any other independent set in G. Show that, for each node v ∈ T, either v ∈ S, or there is a node v' ∈ S so that w(v) ≤ w(v') and (v, v') is an edge of G.
(b) Show that the "heaviest-first" greedy algorithm returns an independent set of total weight at least ¼ times the maximum total weight of any independent set in the grid graph G.
Q11. Recall that in the Knapsack Problem, we have n items, each with a weight wi and a value vi. We also have a weight bound W, and the problem is to select a set of items S of highest possible value subject to the condition that the total weight does not exceed W-that is, ∑i∈S wi ≤ W. Here's one way to look at the approximation algorithm that we designed in this chapter. If we are told there exists a subset O whose total weight is ∑i∈O wi ≤ W and whose total value is ∑i∈O vi = V for some V, then our approximation algorithm can find a set A with total weight ∑i∈A wi ≤ W and total value at least ∑i∈A vi ≥ V/(1+ ε). Thus the algorithm approximates the best value, while keeping the weights strictly under W. (Of course, returning the set O is always a valid solution, but since the problem is NP-hard, we don't expect to always be able to find O itself; the approximation bound of 1+ ε means that other sets A, with slightly less value, can be valid answers as well.)
Now, as is well known, you can always pack a little bit more for a trip just by "sitting on your suitcase"-in other words, by slightly overflowing the allowed weight limit. This too suggests a way of formalizing the approximation question for the Knapsack Problem, but it's the following, different, formalization.
Suppose, as before, that you're given n items with weights and values, as well as parameters W and V; and you're told that there is a subset O whose total weight is ∑i∈O wi ≤ W and whose total value is ∑i∈O vi = V for some V. For a given fixed ε > 0, design a polynomial-time algorithm that finds a subset of items A such that ∑i∈A wi ≤ (1+ ε)W and ∑i∈A vi ≥ V. In other words, you want A to achieve at least as high a total value as the given bound V, but you're allowed to exceed the weight limit W by a factor of 1+ ε.
Q12. Consider the following problem. There is a set U of n nodes, which we can think of as users (e.g., these are locations that need to access a service, such as a Web server). You would like to place servers at multiple locations. Suppose you are given set S possible sites that would be willing to act as locations for the servers. For each site s ∈ S, there is a fee fs ≥ 0 for placing a server at that location. Your goal will be to approximately minimize the cost while providing the service to each of the customers. So far this is very much like the Set Cover Problem: The places s are sets, the weight of set s is fs, and we want to select a collection of sets that covers all users. There is one extra complication: Users u ∈ U can be served from multiple sites, but there is an associated cost dus for serving user u from site s. When the value dus is very high, we do not want to serve user u from site s; and in general the service cost dus serves as an incentive to serve customers from "nearby" servers whenever possible.
So here is the question, which we call the Facility Location Problem: Given the sets U and S, and costs f and d, you need to select a subset A⊆ S at which to place servers (at a cost of ∑s∈A fs), and assign each user u to the active server where it is cheapest to be served, mins∈A dus. The goal is to minimize the overall cost ∑s∈A fs + ∑u∈U mins∈A dus. Give an H(n)-approximation for this problem.
(Note that if all service costs dus are 0 or infinity, then this problem is exactly the Set Cover Problem: fs is the cost of the set named s, and dus is 0 if node u is in set s, and infinity otherwise.)