JetBrains Academy: Edit distance alignment in Java - Kamil-Jankowski/Learning-JAVA GitHub Wiki

JetBrains Academy: Edit distance alignment in Java

Finding an alignment using a scoring function:

When calculating the edit distance for two strings s and t, we try to find the minimum number of operations that transform s into t. The same problem can be considered in another way. We may assign a cost to each operation and say that we want to find the sequence of operations transforming s into t having the minimal cost.

For example, we may say that an insertion, a deletion and a substitution of different symbols cost 1, a substitution of equal symbols costs 0. In this case, the sequence of operations transforming one string into another with the minimal cost will correspond to the edit distance between the strings.

In general case, a scoring function is a way to assign different costs to the edit operations. For instance, we may prefer to use insertions and deletions instead of substitutions. In this case, we may assign a lower cost to the first two operations and a higher cost to a substitution.

Your task here is to write a program that finds an optimal alignment between two strings according to some scoring function.

Input: two strings s and t: ∣s∣,∣t∣ ≤ 100.

Output: the first line should contain the minimal cost of the edit operations transforming ss into tt. The following lines should contain an alignment between the strings.

Use the following costs for the edit operations:

  • insertion/deletion: 2
  • substitution: 3 (for equal symbols use 0)
import java.util.*;

public class Main {
    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        String word1 = scanner.nextLine();
        String word2 = scanner.nextLine();
        
        Alignment alignment = Alignment.editDistanceAlignmentScoring(word1, word2);
        
        System.out.println(alignment.editDistance);
        System.out.println(alignment.source);
        System.out.println(alignment.target);
    }
}

class Alignment {
    public int editDistance;
    public String source;
    public String target;

    public Alignment(int editDist, String source, String target) {
        this.editDistance = editDist;
        this.source = source;
        this.target = target;
    }
    
    private static int matchScore(char a, char b) {
        return (a == b) ? 0 : 3;
    }
    
    public static Alignment editDistanceAlignmentScoring(String s, String t) {
        int[][] d = new int[s.length() + 1][t.length() + 1];

        for (int i = 0; i < s.length() + 1; i++) {
            d[i][0] = i * 2;
        }

        for (int j = 0; j < t.length() + 1; j++) {
            d[0][j] = j * 2;
        }

        for (int i = 1; i < s.length() + 1; i++) {
            for (int j = 1; j < t.length() + 1; j++) {
                int insCost = d[i][j - 1] + 2;
                int delCost = d[i - 1][j] + 2;
                int subCost = d[i - 1][j - 1] + matchScore(s.charAt(i - 1), t.charAt(j - 1));
                d[i][j] = Math.min(Math.min(insCost, delCost), subCost);
            }
        }

        return reconstructAlignmentScore(d, s, t);
    }
    
    private static Alignment reconstructAlignmentScore(int[][] d, String s, String t) {
        StringBuilder ssBuilder = new StringBuilder();
        StringBuilder ttBuilder = new StringBuilder();
        int i = s.length();
        int j = t.length();

        while (i > 0 || j > 0) {
            if (i > 0 && j > 0 && d[i][j] == d[i - 1][j - 1] + matchScore(s.charAt(i - 1), t.charAt(j - 1))) {
                ssBuilder.append(s.charAt(i - 1));
                ttBuilder.append(t.charAt(j - 1));
                i -= 1;
                j -= 1;
            } else if (j > 0 && d[i][j] == d[i][j - 1] + 2) {
                ssBuilder.append("-");
                ttBuilder.append(t.charAt(j - 1));
                j -= 1;
            } else if (i > 0 && d[i][j] == d[i - 1][j] + 2) {
                ssBuilder.append(s.charAt(i - 1));
                ttBuilder.append("-");
                i -= 1;
            }
        }

        String ss = ssBuilder.reverse().toString();
        String tt = ttBuilder.reverse().toString();

        return new Alignment(d[s.length()][t.length()], ss, tt);
    }
}

An approximate matching:

Sometimes when finding a substring in a string, we don't need to get an exact matching. In some cases, it is enough to find an approximate one. Your task here is to write a program that finds a substring in a string that is the most similar to another given string.

Input: two strings s and t: 0 < ∣s∣ ≤ ∣t∣ ≤ 1000.

Output: the first line should contain a number k, such that k is the minimal cost of edit operations transforming s into a substring of t among all the substrings of t. The following lines should contain an alignment of s with the corresponding substring. If there are several possible alignments (substrings), print any of them. Use the following costs for the edit operations:

  • insertion/deletion: 3
  • substitution: 5 (for equal symbols use 0)
import java.util.*;

public class Main {
    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        String s = scanner.nextLine();
        String t = scanner.nextLine();
        
        Approximation approximation = Approximation.editDistanceAlignmentApprox(s, t);
        
        System.out.println(approximation.minDistance);
        System.out.println(approximation.source);
        System.out.println(approximation.target);
    }
}

class Approximation {
    public int minDistance;
    public String source;
    public String target;
    public static int insOrDelCost = 3;
    public static int substitutionCost = 5;

    public Approximation(int editDistance, String source, String target) {
        this.minDistance = editDistance;
        this.source = source;
        this.target = target;
    }

    public static int match(char a, char b) {
        return (a == b) ? 0 : substitutionCost;
    }

    public static Approximation editDistanceAlignmentApprox(String s, String t) {
        int[][] d = new int[s.length() + 1][t.length() + 1];
        int minDistance = Integer.MAX_VALUE;
        int minIndex = -1;

        for (int i = 0; i < s.length() + 1; i++) {
            d[i][0] = i * insOrDelCost;
        }

        for (int j = 0; j < t.length() + 1; j++) {
            d[0][j] = 0;
        }

        for (int i = 1; i < s.length() + 1; i++) {
            for (int j = 1; j < t.length() + 1; j++) {
                int insCost = d[i][j - 1] + insOrDelCost;
                int delCost = d[i - 1][j] + insOrDelCost;
                int subCost = d[i - 1][j - 1] + match(s.charAt(i - 1), t.charAt(j - 1));
                d[i][j] = Math.min(Math.min(insCost, delCost), subCost);
            }
        }

        for (int i = 0; i < t.length() + 1; i++) {
            if (d[s.length()][i] < minDistance) {
                minDistance = d[s.length()][i];
                minIndex = i;
            }
        }

        return reconstructAlignment(d, s, t, minIndex);
    }

    private static Approximation reconstructAlignment(int[][] d, String s, String t, int minIndex) {
        StringBuilder ssBuilder = new StringBuilder();
        StringBuilder ttBuilder = new StringBuilder();
        int i = s.length();
        int j = minIndex;

        while (i > 0) {
            if (i > 0 && j > 0 && d[i][j] == d[i - 1][j - 1] + match(s.charAt(i - 1), t.charAt(j - 1))) {
                ssBuilder.append(s.charAt(i - 1));
                ttBuilder.append(t.charAt(j - 1));
                i -= 1;
                j -= 1;
            } else if (j > 0 && d[i][j] == d[i][j - 1] + insOrDelCost) {
                ssBuilder.append("-");
                ttBuilder.append(t.charAt(j - 1));
                j -= 1;
            } else if (i > 0 && d[i][j] == d[i - 1][j] + insOrDelCost) {
                ssBuilder.append(s.charAt(i - 1));
                ttBuilder.append("-");
                i -= 1;
            }
        }

        String ss = ssBuilder.reverse().toString();
        String tt = ttBuilder.reverse().toString();

        return new Approximation(d[s.length()][minIndex], ss, tt);
    }
}

Multiple sequence alignment:

For now, we considered the algorithm that allows finding an alignment only for two strings. But the algorithm can be extended for finding an alignment for a bigger number of sequences. Your task here is to implement an algorithm that builds an optimal alignment for three strings.

Input: three strings s, t and u. (|s|, |t|, |u| ≤ 20)

Output: an optimal alignment for the strings. Use the following costs for the edit operations:

  • insertion/deletion: 2
  • substitution: 1 (for equal symbols use 0)

NB: An alignment for three strings contains three symbols in each column. The cost of each column depends on how many different symbols are in a column. For example, if there are three different letters in a column, the cost of such a column is 3, since all of three possible pairs of letters cost 1. If there are two gap symbol and a letter in a column, the cost of such a column is 4, since each of the gap symbols with a letter costs 2, and the gaps symbols itself are considered to be equal.

TBD