------------------------------------------------------------------------
Find identical files 
------------------------------------------------------------------------

Danny Goldstein posed the problem:  "find files which are identical
in my file system". This is certainly a problem well suited to
pedagogy. On the other hand, even a cursory web search shows that
it is a real problem with powerful tools that have been  developed
by ace Linux programmers.  Nonetheless, the problem is a nice one
to solve for an intermediate level Unix programmer.

The solution has three steps:
1  Make an input list of files supsected to be duplicates. It is
   understood that most of the files are distinct files.
2. Analyze the files for rudimentary signs of duplication.
   We do this hierarchially: use byte size for round one, identify
   clusters (defined as those with at least two files) and then
   print their checksum.
3. The final high quality check is with a "diff".

Here, I present step 2 (since step 1 is not uniquely posed). 

Consider three sets of files
group 1a: a1                  4 bytes
group 1b: a2 a3 (identical)   4 bytes
group 2a: a4 a5 a6            6 bytes
group 3: a7                   2 bytes

$ cat a1
1
3

$ cat 2
1
2

$ cat a4
1
2
5

$ cat a5
1
2
9

$ cat a6
1
2
3

$ cat a7
1

Let "iif" (identify identical files) be the program. For purposes
of explanation I add lines, using nl.

$ cat iif | nl | tee iif_nl

     1	#!/bin/bash
     2	# 
     3	#identify clusters of files with the same length & print their checksum
     4	#for this pedagogical problem the files are assumed to be "a*" and
     5	#resident in the same directory as this file
      	
     6	ls -l a*  |		#cluster size is defined by file size in bytes
     7	awk '/^-/{                      #analyze only lines with file names
     8		a[$5]=a[$5]$NF" ";	#store files names, successively
     9		b[$5]++;                #number of files in each cluster
    10	     }
    11			        #checksum for clusters with  >1 member
    12	     END{		
    13		for (i in a) {if (b[i]>1) {printf("cksum " a[i]"\n")}}
    14	     }'                  \
    15		| sh -x

The key step here is to build two associative arrays whose index
is the size (in bytes) of the file size. One array, a[],  holds the
file names separated by space and the other, b[], holds the number
of such files (lines 8 and 9). For clusters of files with more than
one member (line 13) We pass the list of files names to "cksum" for
checksum calculations. Note in line 7 the analysis is restricted to
lines with file names (Unix arcana: compare ls -1 * with ls -1 a* to
understand this subtle filtering).

In order to recover "iif" 
$ gsed -E 's/^ +[0-9]+\t//' iif_nl | tee iif    

$ ./iif
+ cksum a1 a2 a3
1862082970 4 a1
1864731933 4 a2
1864731933 4 a3
+ cksum a4 a5 a6
917128627 6 a4
1057854359 6 a5
846835361 6 a6

where each equal-length cluster is identified by line with "+" and
the subsequent lines list the checksum, the size of the file and
the name of the file.