This post is annectodic, but I figured it could still be helpful in some way, so here I go.
Classic story. At work, the CircleCI CI/CD pipeline of the project I work on, as time went by, became slower and slower. Recently, it reached a bit over forty minutes. I worked on it and brought it back under ten minutes. Here is what I did.
Looser cache key with invalid packages clean up
We had a very strict cache key over our poetry lock file. (Poetry is the python package manager we use.) It means that the cache was cold more often than we would have liked. If you read the restore_cache documentation, you can define multiple keys for the same operation where it will sequentially try them and stop as soon as one matches. Partial keys also work. (Look at that example.) So, I added the partial version of our full key and blam, more cache hits than before.
But there is a catch! The project is in python, and as far as I know, pip related tools are great at installing packages from requirements / lock files, but less good at removing unwanted packages. Here is a poetry example to explain what I mean. Run
poetry add requests. Then, run
poetry run pip install ipdb. There you go, you have an “unwanted package”. If you look at the lock file ipdb is not present. There is no command that will detect and remove it. So, back to our cache, with partial keys it’s possible to have an environment that contains an unwanted package and thus not test the real deal. So, I simply came up with a cleanup script that compares the environment with the lock file and removes what is unwanted. Problem solved. (So far… unsure if this will turn out to be a problem at some point.)
Parallelizing with workflows
I did it in two ways. First, our webapp is still packaged with the backend (I know, it’s not a good pattern in general, but context is everything and for us it makes sense), so using workflows I created two jobs: one for the server and one for the webapp. And here goes a couple of minutes.
The second thing I did with parallelizing is related to our AI system. There is a ton of regression and variation tests that do not add to the coverage. So I also moved them to their own job. Here goes a few more minutes.
So, with those parallel jobs we had to install the dependencies twice. Not so bad in itself, but not so great when you know that a cold cache means 7 minutes of extra install time, on two machines, while others may be waiting for CircleCI instance time.
Again, using workflows, I extracted the dependencies install to a prerequisite job. If the cache is warm, it runs fast, if not… it takes up to 7 minutes.
Then, I used the persist_to_workspace function to pass along the built environment.
From machine executor to docker executor
This is the thing that had the biggest impact. Machine executors run on two cores. The documentation also states that the start time for each job is between 30 and 60 seconds. At this point our pipeline had four steps so that’s between two and four minutes of extra time. On the other hand, the docker executor start almost instantly. For the cores it’s less clear. The top header of the job’s UI states that it’s still two cores, but ssh-ing into the jobs and looking at
/proc/cpu info shows 36. And what is crystal clear is the performance: The same jobs running on docker executor executed about 2-3x faster than on the machine executor. There was also an unexpected side effect. The “persist_to_workspace” step was also improved dramatically. The same step took about two minutes on the machine executor and it went down to forty seconds. Attaching the workspace also went down from thirty to ten seconds. Another nice gain!
Bonus: More testing
While at it, a team member told me it would be great to test the docker image build process all the time rather than just see if it fails at deploy time. All I did then was cut the build + push image in two steps. The push step still only runs on certain branches (dev and prod) while the build step run all the time.To avoid rebuilding the image again in the push step, in the build step, I run
docker save, then move the resulting file to the persistent workspace. Then the push step attaches it, runs
docker load and then
I did not try to reuse the original pipeline and just change the executor, but I’m certain it would have yielded impressive results already. We had a reason for not using it in first place, which went away with time, but had we known how slow it was maybe we would have dealt with the issue differently.