------------------------------------------------------------------------ Find identical files ------------------------------------------------------------------------ Danny Goldstein posed the problem: "find files which are identical in my file system". This is certainly a problem well suited to pedagogy. On the other hand, even a cursory web search shows that it is a real problem with powerful tools that have been developed by ace Linux programmers. Nonetheless, the problem is a nice one to solve for an intermediate level Unix programmer. The solution has three steps: 1 Make an input list of files supsected to be duplicates. It is understood that most of the files are distinct files. 2. Analyze the files for rudimentary signs of duplication. We do this hierarchially: use byte size for round one, identify clusters (defined as those with at least two files) and then print their checksum. 3. The final high quality check is with a "diff". Here, I present step 2 (since step 1 is not uniquely posed). Consider three sets of files group 1a: a1 4 bytes group 1b: a2 a3 (identical) 4 bytes group 2a: a4 a5 a6 6 bytes group 3: a7 2 bytes $ cat a1 1 3 $ cat 2 1 2 $ cat a4 1 2 5 $ cat a5 1 2 9 $ cat a6 1 2 3 $ cat a7 1 Let "iif" (identify identical files) be the program. For purposes of explanation I add lines, using nl. $ cat iif | nl | tee iif_nl 1 #!/bin/bash 2 # 3 #identify clusters of files with the same length & print their checksum 4 #for this pedagogical problem the files are assumed to be "a*" and 5 #resident in the same directory as this file 6 ls -l a* | #cluster size is defined by file size in bytes 7 awk '/^-/{ #analyze only lines with file names 8 a[$5]=a[$5]$NF" "; #store files names, successively 9 b[$5]++; #number of files in each cluster 10 } 11 #checksum for clusters with >1 member 12 END{ 13 for (i in a) {if (b[i]>1) {printf("cksum " a[i]"\n")}} 14 }' \ 15 | sh -x The key step here is to build two associative arrays whose index is the size (in bytes) of the file size. One array, a[], holds the file names separated by space and the other, b[], holds the number of such files (lines 8 and 9). For clusters of files with more than one member (line 13) We pass the list of files names to "cksum" for checksum calculations. Note in line 7 the analysis is restricted to lines with file names (Unix arcana: compare ls -1 * with ls -1 a* to understand this subtle filtering). In order to recover "iif" $ gsed -E 's/^ +[0-9]+\t//' iif_nl | tee iif $ ./iif + cksum a1 a2 a3 1862082970 4 a1 1864731933 4 a2 1864731933 4 a3 + cksum a4 a5 a6 917128627 6 a4 1057854359 6 a5 846835361 6 a6 where each equal-length cluster is identified by line with "+" and the subsequent lines list the checksum, the size of the file and the name of the file.