Bill,
Your postings to the dev@glassfish alias prompted me to run some
numbers. I did a find on the binary file types in the
/m/glassfish-svn-to-hg repo and then sent the output to a shell script
that summed the bytes. If you are curious on how I got my results, see
[1]
|
File type
|
Number of files
|
Total bytes M
|
|
jar
|
374
|
300
|
|
dll
|
1
|
1
|
|
pdf
|
46
|
30
|
|
rar
|
12
|
13
|
|
swf
|
3
|
8
|
|
zip
|
45
|
13
|
|
Total
|
481
|
365
|
If we assume that these files should not be in the v3 hg developer
repos (pdfs could go into a separate www repo), the current size of the
/m/glassfish-svn-to-hg repo is about 1GB. If we remove the current
size of the .hg repository of 600M, we are left with 400M of mostly
text files. So, what is the breakdown?
|
File Type (estimates)
|
Mbtyes
|
|
Binary files
|
365
|
|
Text files
|
400
|
|
hg history files
|
600
|
|
Total
|
1365
|
Now, of the the total non-history files in the repo (765M), 365M should
not be there. That works out to 365 M/ 765M = 0.477 or 48%. If we
removed the bloat, the revised size of the v3 .hg repository is:
600M x .52 = 312M
So, the pruned size of the entire hg v3 repo (text files plus text file
history) is about 400M + 312 M = 712M
I expect that there are a number of opportunities to further reduce the
size of the existing svn repo. I will send a follow up email that
estimates the size of each of the modularized repos.
Thanks,
Paul
-------------------------------------------------------------------------------------------------------------------------------------
Notes from Ken:
Note that one problem we have is people confusing the download size and
the working copy size.
Download is what hg pull sees, and for an initial pull, should be more
or less the contents of .hg.
hg pull may or may not compress on pull (it doesn't for ssh, but you
can configure ssh to compress),
which for text files should give good results. After the pull (or as
part of an initial hg clone),
the hg update roughly doubles the size of the local repository. I
think everything we've seen says that
the history is a rather small part of the repo size, compared to the
large number (38000 or so) of text
files.
Just as an experiment, I cloned the hg repo and chopped out most of the
big binaries
(This probably corrupted the repository, but I'm only interested in
rough sizes here).
A tarball of the .hg directory takes up 304 Mbytes, which gzip's to 230
MB. This might
be closer to the size for getting all of the repository, not including
further cleanups, and
especially modularization of the workspace. For example, www is 191
MB, and that should
probably be a separate repository, since most developers (other than
doc writers contributing
tutorials and such: we actually have outside contributes doing that for
Grizzly) won't be working
on the docs.
If we delete the www directory, the tar ball reduces to 205 MB, and
gzips to 140 MB.
Further reductions by splitting into more repositories should get us to
a typical developer
repository download size of 20-40 MB or so.
-------------------------------------------------------------------------------------------------------------------------------------
[1]
find . -name '*.jar' -exec ls -l {} \; | awk '{print $5}' >
/home/psterk/jar.file.bytes
addFiles.sh
#! /bin/bash
set -- `< $1` ## for multiple files use: set -- `cat "$@"`
q=$*
printf "%s\n" $(( ${q// / + } ))
addFiles.sh jar.file.bytes