======================
Storage speed analysis
======================
We performed a quick study to assess the time taken to transfer data on a remote Git repository.
Three methods are tested:
* regular Git push,
* tracking the new files with `Git LFS `_ (Large File Storage) and pushing,
* putting files on an S3 bucket using `s3fs `_ and pushing symbolic links.
We vary the number of added files from 1 to 100 and the total added size from 1 Mo to 100 Mo.
The original size of the repository does not significantly affect the measured durations.
Therefore, the repository is cleaned before each test.
Then, a linear fit is performed on the time measurements for each method.
The results are reported in the following table.
.. table:: Time decomposition of different storage methods
:align: center
======= ============ ============== ==================
Method Constant [s] Per Mo [ms/Mo] Per file [ms/file]
======= ============ ============== ==================
Regular 2.0 110 0
LFS 4.6 70 90
S3 1.5 110 400
======= ============ ============== ==================
Some amount of time is independent of the number of files and the total size (labelled as "Constant"), which could be the time taken to perform Git operations and establish the connection with the remote repository.
Without any surprise, there is a portion of time that is proportional to the total transferred size (column "Per Mo").
However, for the LFS and S3 methods, there is also a portion of time that is proportional to the number of files (column "Per file").
This could be due to initiating a new connection for each file, resulting in a significant overhead when the number of files is large.
All the measurements and fitted curves are shown in the figures below.
.. figure:: figures/git_push_added_size.png
:scale: 65 %
:align: center
Time taken to push files on a regular remote Git repository
Dots are measurements, the line is a linear regression (note the logarithmic scale).
Here, the transfer duration does not depend on the number of files (trying to include this term in the regression leads to a very small component).
We suspect that Git compresses the data into a single payload before sending it to the remote repository, limiting the number of requests.
.. figure:: figures/git_push_lfs_n_files_new.png
:scale: 65 %
:align: center
Time taken to push files using LFS (large file storage)
In this case, the transfer duration also scales with the number of files.
When the number of files is large, the transfer duration is dominated by the time taken to establish a connection with the remote repository for each file.
.. figure:: figures/git_push_s3_n_files_new.png
:scale: 65 %
:align: center
Time taken to upload files to an S3 bucket and pushing symbolic links
The component of time proportional to the number of files is even larger than for LFS.