Official Solution:
Problem G: Phylogenetic Trees Inherited
The first thing to observe is that the different positions in every sequence are independent of each other. This reduces the tree of sequences to a tree of amino acids. At the root of the tree, or for that matter of any sub-tree, there may be many possible amino acids leading to optimal costs. Suppose, you have calculated for two sub-trees Tl and Tr the sets of amino-acids leading to optimal costs Al and Ar. Adjacent sub-trees Tl and Tr have as their father the node T. Now you want to find the set of amino-acids A that you can mark T with, leading to optimal costs for T.
There are two cases: if the intersection of Al with Ar is non-empty, define A as just this intersection, otherwise define A to be the union of Al and Ar. To see why this is true, observe the extra costs you get, when you assemble T from Tl and Tr. In the first case, you have 0 extra costs when you take an amino-acid from the intersection, but 1 or 2 extra costs when you do not. In the second case, you have 1 extra costs when you take an amino-acid from the union, but 2 extra-costs when you do not. Now, you may want to assemble T not from Tl and Tr but from some other sub-optimal trees. As you can easily verify, this leads to sub-optimal costs for T as well.
This reasoning is carried over straightforwardly to an induction proof and leads to a dynamic programming solution. Since the amino-acids are upper-case letters, you can represent sets of amino-acids as ints. The set operations you need are then easily performed as bitwise and respectively or. Whenever you do a union operation, your costs increase by 1.
Another, more straight-forward solution is to calculate for each node of the tree the optimal costs for every amino acid the node can be marked with. This is done by trying every possible combination of amino acids for the two sub-trees, assuming their optimal costs have already been calculated. Since this solution might turn out to be too inefficient, it can be improved upon by observing that a father node always can be marked with either the left or the right son's amino-acid - there is no need to take an amino acid that differs from both.
Judges' test data was constructed from a test-case with a few long sequences, a test-case with many short sequences, a test-case where every possible pair of amino-acids occured, and 100 random-generated test-cases where length and number of sequences are geometrically distributed. The total number of test-cases is 110. Since there may be multiple correct answers for the test cases, a special verification program was written by slightly modifying the judges' solution.
1 /* Accepted 1102 C++ 00:00.56 1040K */
2 #include <string>
3 #include <iostream>
4
5 using namespace std;
6
7 int main()
8 {
9 int n, l;
10 while((cin >> n >> l) && n && l)
11 {
12 int heap[2048], cost = 0;
13 string seq[1024];
14
15 for(int i = 0; i < n; i++)
16 cin >> seq[i];
17
18 for(int i = 0; i < l; i++)
19 {
20 for(int k = 0; k < n; k++)
21 heap[n + k] = 1 << (seq[k][i] - 'A');
22 for(int k = n - 1; k >= 1; k--)
23 if((heap[k] = heap[2 * k] & heap[2 * k + 1]) == 0)
24 {
25 cost++;
26 heap[k] = heap[2 * k] | heap[2 * k + 1];
27 }
28 char c = 'A';
29 while(heap[1] >>= 1)
30 c++;
31 cout << c;
32 }
33 cout << ' ' << cost << endl;
34 }
35
36 return 0;
37 }
38