Diff: DistributionChart

Differences between current version and previous revision of DistributionChart.

Other diffs: Previous Major Revision, Previous Author

Newer page: version 3 Last edited on March 1, 2012 1:46 pm by PhilHollenback
Older page: version 2 Last edited on February 24, 2011 8:05 pm by PhilHollenback Revert
@@ -34,30 +34,7 @@
 Improving this to tell you which files are the same is left as an exercise for the reader. 
  
 ----- 
  
-<?plugin RawHtml  
-<script>  
-var idcomments_acct = '011e5665a1128cdbe79c8077f0f04353';  
-var idcomments_post_id;  
-var idcomments_post_url;  
-</script>  
-<span id="IDCommentsPostTitle" style="display:none"></span>  
-<script type='text/javascript' src='http://www.intensedebate.com/js/genericCommentWrapperV2.js'></script>  
-?>  
+CategoryGeekStuff  
  
------  
-  
-<?plugin RawHtml  
-<center>  
-<script type="text/javascript"><!--  
-google_ad_client = "pub-5011581245921339";  
-google_ad_width = 728;  
-google_ad_height = 90;  
-google_ad_format = "728x90_as";  
-google_ad_channel ="";  
-//--></script>  
-<script type="text/javascript"  
- src="http://pagead2.googlesyndication.com/pagead/show_ads.js">  
-</script>  
-</center>  
-?>  
+CategoryBlog  

current version

Command Line Distribution Chart

Scenario: you have a whole bunch of files that are mostly identical. You want to know the distribution of identical files vs. non-identical files. How do you do that on the unix command line?

Here's my solution:

First, run md5sum on all the files to get a hash of every file. Identical files will have the same hash.

Then, sort the results and find the unique ones. Count how many occurrences you find of each hash.

Here's the bash fragment to do this:

for i in *.dat
do
  md5sum $i | awk '{ print $1}'
done | sort | uniq -c

and here's what the output looks like:

      1 0f1c9426c5959d478d49f49063016563
     31 2846bde822c8d77c752fbb88e2d77997
      1 4be0e00d2cc87929e08b69d5e20700df
      1 5d3a104d7e3b5587791bc392c699736c
      3 9faa92c5423fc00e2ad1e47000e43cd4
      1 ccf2fb7b5278d8ceb48ce66bc141178f

this shows clearly that the majority of the files have the same content and there are just a few outliers.

Improving this to tell you which files are the same is left as an exercise for the reader.


CategoryGeekStuff

CategoryBlog



Our Founder
ToolboxClick to hide/show