dev@glassfish.java.net

Quick analysis of v3 hg repo

From: Paul Sterk <Paul.Sterk_at_Sun.COM>
Date: Mon, 10 Dec 2007 17:39:18 -0800

Bill,

Your postings to the dev@glassfish alias prompted me to run some numbers. I did a find on the binary  file types in the /m/glassfish-svn-to-hg repo and then sent the output to a shell script that summed the bytes.  If you are curious on how I got my results, see [1]

File type Number of files Total bytes M
jar 374 300
dll 1 1
pdf 46 30
rar 12 13
swf 3 8
zip 45 13
Total 481 365

If we assume that these files should not be in the v3 hg developer repos (pdfs could go into a separate www repo), the current size of the /m/glassfish-svn-to-hg repo is about 1GB.  If we remove the current size of the .hg repository of 600M, we are left with 400M of mostly text files.  So, what is the breakdown?

File Type (estimates) Mbtyes
Binary files 365
Text files 400
hg history files 600
Total 1365

Now, of the the total non-history files in the repo (765M), 365M should not be there.  That works out to 365 M/ 765M = 0.477 or 48%.  If we removed the bloat, the revised size of the v3 .hg repository is:

600M x .52 = 312M

So, the pruned size of the entire hg v3 repo (text files plus text file history) is about 400M + 312 M = 712M

I expect that there are a number of opportunities to further reduce the size of the existing svn repo.  I will send a follow up email that estimates the size of each of the modularized repos.

Thanks,
Paul

-------------------------------------------------------------------------------------------------------------------------------------
Notes from Ken:

Note that one problem we have is people confusing the download size and the working copy size.
Download is what hg pull sees, and for an initial pull, should be more or less the contents of .hg.
hg pull may or may not compress on pull (it doesn't for ssh, but you can configure ssh to compress),
which for text files should give good results.  After the pull (or as part of an initial hg clone),
the hg update roughly doubles the size of the local repository.  I think everything we've seen says that
the history is a rather small part of the repo size, compared to the large number (38000 or so) of text
files.

Just as an experiment, I cloned the hg repo and chopped out most of the big binaries
(This probably corrupted the repository, but I'm only interested in rough sizes here).
A tarball of the .hg directory takes up 304 Mbytes, which gzip's to 230 MB.  This might
be closer to the size for getting all of the repository, not including further cleanups, and
especially modularization of the workspace.  For example, www is 191 MB, and that should
probably be a separate repository, since most developers (other than doc writers contributing
tutorials and such: we actually have outside contributes doing that for Grizzly) won't be working
on the docs.

If we delete the www directory, the tar ball reduces to 205 MB, and gzips to 140 MB.
Further reductions by splitting into more repositories should get us to a typical developer
repository download size of 20-40 MB or so.
-------------------------------------------------------------------------------------------------------------------------------------

[1]
find . -name '*.jar' -exec ls -l {} \; | awk '{print $5}' > /home/psterk/jar.file.bytes

addFiles.sh
#! /bin/bash
set -- `< $1`    ## for multiple files use: set -- `cat "$@"`
q=$*
printf "%s\n" $(( ${q// / + } ))

addFiles.sh jar.file.bytes