- 
                Notifications
    You must be signed in to change notification settings 
- Fork 48
Add rsync option to package command and improve performance #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… bower_components
This reverts commit 2b38622.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mike-potter for working on this! It's been a goal of mine for awhile. I'm in the middle of a review, it might take me a few sessions to complete.
| - echo "Working copy disk usage"; du -chs . | ||
| - grunt --version | ||
| - grunt help | ||
| - grunt --quiet --timer | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove quiet? Seems like an unrelated change. We might make the test less verbose by setting the environment variable GDT_QUIET.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the --quiet because it was super hard to debug the failing travis tests. Travis already spits out a ton of stuff and seeing what the grunt tasks actually do and if they get errors is important I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was --quiet suppressing errors too? In our case it's used to suppress desktop notifications which are irrelevant from inside travis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package command had a bug that was not completing all of its tasks, so without seeing the task output it was difficult to debug. Also hard to see the actual tasks being run through the cloud of all the other output generated in the travis log. I don't consider those desktop notifications irrelevant for travis as they aided me in debugging. And it doesn't hurt to have them, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the --quiet option and GDT_QUIET env var are supposed to only affect desktop notifications (i.e. growl). If the option affects the error reporting from Grunt, we should change the option to avoid conflicts.
|  | ||
| **rsync**: If set to true, rsync will be used instead of file copying for | ||
| improved performance. This automatically enables the `compress` option and will | ||
| write the archive to the package destination. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why enable compress by default? The main use case I've had for this requires not doing it, so curious what the thought is. Don't recall if that can be turned off, we should note that in docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a default. You can still set "archive" to false if you don't want it. Just trying to support the best defaults for new projects. I want "rsync=true" to represent the default behavior for "high performance" that enables archive, vm_data, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose a project leaves it creating an archive, and turns on rsync. What is the use case we're aiming for? Operational code backups? I'm open to calling this a best default, I'm just trying to understand what we're talking about empowering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By creating a single archive and using rsync you get the highest performance. There is only a single file (the archive) the needs to be moved out of the VM disk and into the NFS mounted system which is hugely faster than moving the non-archived directory.
What they do with the archive is project specific. I expect we'll eventually add a "deploy" command that does all of the git commit/push work from inside grunt, but for now the archived package can be moved to a hosting site, or uncompressed locally and committed to a production repo.
Basically, we are just trying to minimize the total data flow to NFS along with the individual number of files (since each file has NFS overhead, so lots of small files are really bad performance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, writing the tarball being less I/O, offset by the time to create the tarball. Have you noted the time to create the tarball is less than syncing the files individually? I expect so.
| package when using the `rsync` option. By selecting a volume mounted in your | ||
| VM `/data` area, this can significantly improve performance. The archive is | ||
| still moved to the normal package destination when complete. Defaults to | ||
| `build/vm_data`. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GDT is not targeted at virtual/containerized environments. This piece needs a bit more contextual documentation and to remove /data as an explicit concept.
Seems strange to place it in build/vm_data, why not build/package/tmp or even get a formal /tmp directory generated via os.tmpdir()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the super important part of this PR. Without the virtual intermediate directory there was no way to get anything performant and people are bypassing gdt on their projects because of this. I know GDT is not "targetting" at containers, which is why this option isn't enabled by default.
The idea here is that in your docker container you mount a volume that gdt can use for various purposes to improve performance. Only tackling the packaging right now, but I can imagine other commands using this also. So don't want it to be within "build/package". The /data is just an example that pertains to our own container, but that isn't anywhere in the GDT except the docs as an example. Since our docs already reference many P2 practices, I don't see this as a problem right now (see the CI docs for example).
Still, specific updates to this doc file are welcome if you want to wordsmith it a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure all config options and documentation are not overly bound to containers or virtual environments specifically, even if the changes we have facilitate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this less specific to Docker and the /data path set up by devtools, could this be recast as a temporary directory path (i.e. tempDir rather than vmData)? We can note in the documentation that the purpose for this path is to allow using a path that performs quickly if the project directory lives in a slow performing location (i.e. NFS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think calling it a tempDir option would be fine. I'll make that change to the code and docs. Thanks for the suggestion.
|  | ||
| ## Performance | ||
|  | ||
| When running `grunt package` within a docker container, the default file | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we focus more on NFS mounts and less on Docker? I'm fine with Docker in the docs, but it should be almost more like a sidebar detail. It's great that we're looking into accelerating this for use with NFS mounts, it would be great if we could identify similar tricks for Drush Make and Composer (build in an intermediate directory, transfer to the volume mounted space where speedup can better affect the complete codebase.
The major issue that originally drove me to look into rsync and other alternatives to copy was more about size of projects - a built up D8 project or a vanilla Atrium project will break Node under default Node.js settings rather than complete the operation. I'd like to note this -- it's going to be the more common problem than NFS outside Phase2 usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I covered this above, but the whole point of this PR was performance. Switching from copy to rsync gives a small performance improvement. The majority of the improvement comes from using the vm_data.
Yes, it's really about NFS and not docker, so the docs can be wordsmithed a bit.
And yes, this vm_data is definitely something we want to incorporate into other areas of gdt. The package command was just the big target because it was taking 40 minutes on a project.
| var archive = grunt.config('config.packages.archive') || useRsync; | ||
|  | ||
| // When using rsync, generate to a path that can be mounted in the VM. | ||
| var vmPath = grunt.config.get('config.packages.vmData') || 'build/vm_data'; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per earlier comments, I'd like a more contextual, less tech-specific name for this -- intermediary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specific suggestions are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per other thread, we're calling it tempDir. Just closing the loop here :) Certainly shorter than intermediary.
|  | ||
| var excludePaths = ['bower_components', 'node_modules', '.gitkeep']; | ||
|  | ||
| if (useRsync) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As previously explored with https://github.com/phase2/grunt-drupal-tasks/blob/master/tasks/make.js#L71, we need a fallback for Windows which so far has been simply "Got Windows? Okay, we'll still use copy!" Because the configuration for copy and rsync are not fully compatible, I've been thinking of rsync not as a mode, but rsync to copy fallback as a breaking change so we can revise configuration requirements. Open to discussion on this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code I added makes them compatible. You still use srcFiles, projFiles, and the new excludePaths. The code builds the proper arguments for either copy or rsync. So yes, Windows cannot use the "rsync=true" option.
If there is a simple way to detect if rsync is installed I'd be happy to use that to set the default and have it fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a canRsync utility function used by the code I linked to. I'd like that to be checked as part of the condition whether we are using Rsync. We can make that function smarter to detect if rsync is installed in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I didn't see the canRsync. I'll add that.
| FYI: Travis tests are failing because of #304 and not because of anything in this PR. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to call it there on a review of the changeset, I think I'm going to need to look at the modified version of package.js directly to make sure I'm not misremembering anything.
| if (item.substr(0, 1) === '!') { | ||
| item = item.slice(1); | ||
| item = item.replace('**', '*'); | ||
| excludePaths.push(item); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really smart, however, the filesystem matching by the Node minimatch library has more breadth of syntax than is typically supported by Bash/Rsync/etc beyond just negation, such as fancy globbing and simple expressions. That's the incompatibility I alluded to earlier. I think we need to confirm that failure is reasonably graceful as a minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I think we can address that in the documentation. IMHO none of our config options should be tied to a specific implementation (node, rsync, etc). So we should specify how wildcards are handled and then convert to the needed format for either Copy or Rsync.
Right now the only issue is ** vs * and rsync seems to take the ** and still function, so I didn't need to change that.
If people want fancy expressions then we should use some form of RegEx that would be less implementation-specific. But that's work for later. For now if somebody wants to use a fancy Node minimatch expression then they just can't use the new rsync option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Confirm using a minimatch expression results in rsync failing in a reasonably informative and quick manner.
- Release notes and documentation detail the new restriction to stick with Bash-compatible file matching patterns with rsync mode.
| grunt.config('shell.mvArchive', { | ||
| command: 'mv ' + destPath + '.tgz ' + finalPath + '.tgz' | ||
| }); | ||
| tasks.push('shell:mvArchive'); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of note, mv will not work across NFS volume mounts. We will want to document that limitation a bit to help inform how projects are configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that's what I thought too. But it seems to work across docker volumes. Maybe we should change it from a "mv" to a normal grunt copy. Then delete the tarball from the destPath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe that restriction is only outside the container. mv is an order of magnitude faster than copy, so if it works I think we should keep it until we have reason to switch to copy.
A bigger issue is windows compat. For portability we should just switch to an fs.rename[Sync].
| } | ||
| tasks.push('copy:package'); | ||
|  | ||
| // If the `composer.json` file is being packaged, rebuild composer dependencies without dev. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy:package looks like it's been stripped down to some final copy operations of random project files. Maybe this should be renamed for clarity? copy:package-extras?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I even thought of "copy:projFiles" to be really specific. If we call it "package-extras" then having the config be "projFiles" might be confusing too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems fine. This is a stepping stone to #255, otherwise I'd suggest we drop projFiles and rename the config for clarity.
| // Setup default rsync command and options. | ||
| var rsync = 'rsync -ahWL --no-l --stats --chmod=Du+rwx '; | ||
| for (var pathIndex = 0; pathIndex < excludePaths.length; pathIndex++) { | ||
| rsync = rsync + "--exclude " + excludePaths[pathIndex] + ' '; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like only the final excluded file would be included here. Shouldn't we remove the loop and replace with an array join on whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding "--exclude path" to rsync for each item in the array. So maybe a join with "--exclude " would work? Otherwise this loop does actually work fine.
| srcFiles.unshift('**'); | ||
| // Add any additional excludePaths to srcFiles. | ||
| for (var i = 0; i < excludePaths.length; i++) { | ||
| srcFiles.push('!**/' + excludePaths[i]); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array merge seems more clean here. Also, something to dedup might be worthwhile to make sure any code that wants to remove exclusions only has to find a single instance of a mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally looped through excludePaths to add the '!**/' and then merged the array with srcFiles but this seemed cleaner. So I guess I'd welcome a code alternative to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
var _ = require('lodash'); at the top.
srcFiles = _.union(srcFiles, _.map(excludePaths, function (item) {
  return '!**/' + item;
}));
Union both merges and unique-ifies.
Looks like we should check on updating lodash.
| @mike-potter Could we target this for a 1.1.0 milestone? I think the current master version is in good shape for a rc1 release and a stable one after a bit of testing. | 
| Sure, we can do 1.1.0. I agree that we need to get 1.0.0 released so people can start using the composer-based support. And adding the "tempdir" option probably opens up other issues we'd want to work on for improving performance elsewhere in GDT that could benefit. | 
| A sample docker-composer.yml entry to mount the vm_data looks like this: | ||
| ``` | ||
| volumes: | ||
| - /data/PROJECTNAME/package:/var/www/build/vm_data | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've picked up this branch and am trying to pull it along. As part of this I'm reviewing and revising the docs to clarify my game plan. I'm confused why we would volume mount the intermediary directory. Wouldn't that slow down the rsync operation compared to leaving the tempDir somewhere in the container but not in a volume mount where NFS slowdowns can catch it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The /data volume is just in the VM, not on the local Mac NFS system. This simply makes the stuff in /var/www/build/vm_data persistent, similar to what we do with the MySQL database. Volumes are only slow if they map the local Mac folders
This is a fairly significant patch, but should not affect existing projects. It adds a "packages.rsync" option (False by default). When set to true, the package will be generated in a separate folder (packages.vmData defaults to "build/vm_data") via rsync and the compress option is enabled to generate an archive from the package, which is moved to the normal build/packages destination.
By mounting build/vm_data into the VM in your docker_composer.yml file, this allows the package to be built directly to the VM bypassing NFS to the local filesystem. Only the final archive.tgz is moved into the local file system. The vmData persists so the Build container can still perform additional tasks as needed, such as pushing the package into a remote repo. Or, the local tgz archive can be used to deploy the package.
On a current real project, this reduced the normal "grunt package" command running from the build container from a 40-minute task, down to a TWO minute task.
Given the significant performance improvement, I'm marking this PR for Milestone 1, so let's get reviews of this approach. (FYI it was modeled from some code in a "grunt deploy" task written for another project by Adam)