Git seems to be mentioned everywhere these days. Do you have a clear understanding of it? Review key concepts to leverage Git in your pipeline.
Have you heard about Git but you don’t want to read a 500 page long document about it? Are you happy with just enough knowledge about how it works or just need to refresh the core concepts? Have you gone through some books and tutorials and you are still confused about some basic concepts?
If the above describes you, then this article is perfect for you! This post is not intended to be a complete description of what Git is and how to use it, but rather a description of a few concepts that can be useful if you are new to Git. If you’re interested in gaining a deep understanding of Git, you should look at the official reference book and some of the many courses available about the topic: [Git](https://git-scm.com/)
There is also a free e-book recommended by Git called [Pro Git Book](https://git-scm.com/book/en/v2)
And of course, the Github docs are a great source for more details about forking a repo and using it in the Github workflow (where there is an introduction on PR). Although there are other methods, this one is quite common in Open Source.
[Fork a repo - GitHub Docs](https://docs.github.com/en/get-started/quickstart/fork-a-repo)
In our team, we use [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/)
## What is Git?
Git is a tool for software version management that is used by millions of developers both in the command line and using different GUI. It is also in the core of GitOps and other Git based infrastructure and configuration management workflows. Knowing Git is a key requirement for professional development, and as of late, for proficiency in environments using Infrastructure as Code.
In Monokle, we have identified a market need for better tools that can leverage Git to support configuration version management. We asked around and discovered that some people on our team, who were not developers, did not know enough about it. Not only that, Kubernetes is used and managed for non-developers, like operators that are not familiar with software version management, TDD, BDD, and other similar topics that make it easier to work with software artifacts.
So, I decided to write a small article about the most basic concepts needed to understand what Git is and how it works, without going into the details of the really hard things, like conflict management.
I hope you like it!
## Getting started with Git
Git is used for distributed software version control meaning it is a system that records changes to files or sets of files so you can recall a specific version later. It can store changes in any kind of files, although it will normally be used for software source code. (Be careful, as binary files are not easy to compare and diff and they usually occupy a lot of space).
In the past, it was normal to have source version control systems where one server stored the data. You simply connected to download and upload the version you were working with. But recently, that has been abandoned in favor of distributed systems. What distributed means is that any time you make a copy of the files to another server, you are creating a full mirror of everything, a full copy and not just a subset. Any “clone” is thus a full backup and can work with different workflows, with or without other masters, and even serve as a new master file if you need it to.
For instance, you could have different versions of your design documents, or a full version of all the different rewritings of your latest novel.
## Git repositories
Git works with **repositories**, that are the basic elements that get updated and synchronized as you work. Repositories are folders with files inside containing some metadata. It is possible to tell Git to ignore some of the files included if you don’t want to share them (i.e. binary files, secrets, OS specific files, and anything that is specific to you and not to the source code).
There are two ways to obtain a Git repository:
* You can turn a local directory into a Git repository (_git init)_
* You can **clone** an existing repository that is stored in another place. This will create a local copy with all the metadata already in place.
In both cases, you end up with a folder containing files and some amount of metadata inside a _.git_ subdirectory at the base of your folder. That subdirectory stores information about every file and modification done to it. It is efficient because it mostly stores the differences (which is why binaries are not a good fit for this).
There is a catch: *files are NOT automatically added to a repository when you create it* You need to manually add them so they are tracked. And, in addition to that, files are not automatically updated in the Git metadata so you have to manually trigger a sync.
## Recording changes to files
Files need to be tracked and updated. You can actually work on a file without Git storing the changes in the metadata (compare that with Google Docs where every change is stored automatically as a new version every few seconds), so you know in each moment the state of your files:
You can know the state of any file and folder in a repository using `Git status<file>`
Files follow a flow inside Git and can be in different states:
* By default any file in the repo is **Untracked** and Git won’t store any information about it. The output of Git status will show as _Untracked files_.
* You can _add_ files and folders to be tracked by Git, putting them in the **staging** area. When you stage a file, you store a snapshot of the information in Git but you are still not following the history of those changes.
* If you add a file several times, and you edit and stage it again, you will stage only the last snapshot.
Adds a snapshot of every file in the directory: `git add .`
Adds a snapshot of the file or folder: `git add <filename>`
* In order to be able to store and manage versions of a file, you must **commit** a new version.
* Committing will create a new version in the history of every file that is staged and identify that version with a hash.
* At that moment, the stage and commit files will be equivalent.
`git commit -m <commit message>`
* `git commit` will ask for the commit message
You can edit a file at any moment:
* If the file is *Untracked*, it will continue to be so.
* If the file is *Staged*, Git will identify it as *Modified (and will keep a snapshot)*.
* If the file is <em>Committed</em>, Git will identify it as <em>Modified</em>.
* A file can be in all of these stages at the same time:
* *Modified*: the filesystem version is not the same as the staged one.
* *Staged*: the staged version has not been committed.
* *Committed*: there is a version that has been included in a commit.
* You always commit the stage version, even if a modified version exists, so you don’t forget to add it before committing.
It is possible to work with the different versions:
* You can copy a modified version to stage and from stage to the working version.
* You can commit a staged version as part of a commit and you can copy a committed version to stage.
* You can ignore a file so it does not appear in your status (this comes in handy to avoid .DS_Store files or binary outputs to be committed so as not to occupy space).
Files are ignored using a .gitignore file in the folder that tells Git to ignore some patterns (like *.a) or track a file even if you are ignoring it with another pattern (!mylib.a).
It is even possible to get a report on every change in and out of stage and commit.
`git diff` - Will report on addition and subtraction to all files
## Commits and branches
Commits are stored in the history of Git and can be identified with a unique identifier (SHA-1 checksum) that works as a history line. A commit stores the changes from a previous commit.
Commits can be seen using the log command, which will show a list of all the commits in reverse chronological order starting with the most recent commits first.
`git log`
Commits are great to compare different versions. You can compare at any time, any commit with another commit and see the difference between commits, too.
`git diff 0be526a 161bec8`
If you want to go back to a previous version, you can fix a commit using Git revert which will create a new commit that reverts the changes made by other commits. This won’t change your history, because the history will add a new commit that will undo the changes, but the previous commits will still be there.
If there were no mistakes and you always knew exactly what was the next step, this would be a great way of knowing what you have done and go back to it. Basically, you never lose any change committed and you can work on top of any version stored. A single history line, however, can be too limited if you want to explore or fix things without breaking the code.
Sometimes, you want to be able to store different parallel versions of the same docs. For instance, you are working on the new version of the application while you are working on the existing version, and thus, you need to be able to switch from the different versions of your application. In Git, these actions are called **branches**.
Each commit stores the version of the files that exists in the staging area, some information about the user, and other metadata, with a pointer to the content (blobs) and the directory (tree), and a pointer to the commits that come before.
A branch is then a movable pointer to one of those commits. The default branch is _main_ (although it used to be _master_ so you can find some repos that still use that name), but it makes no difference what branch you use as the basic branch, they are all the same. Each branch is just a history of commits, some of them shared between branches.
When you create a new branch, a new pointer is created (dev), but now you have to choose between two different branches that share commits. So, we need to know exactly what story we are reviewing: that is what HEAD does for us, it points to one of the branches.
1. Creates a new branch called dev: `git branch dev`
2. Changes head to the new branch: `git checkout dev`
3. Changes back to main: `git checkout main`
From the moment we take the snapshot, whatever branch we are in will be **the** branch. Everything we have done so far will continue working.
For instance, we will be able to have two separated histories in each branch that evolve differently, with the same workflows (staged, modified, committed).
The main reason to create a new branch is either to maintain a completely separated codebase (i.e. version 1 and 2 of our application), or to work in a feature without affecting the work of others, maintaining a working shared history (the main branch).
When you want to add changes in one branch to another, the way to do that is to **merge** one branch into another.
1. Go to the main branch: `git checkout main`
2. Apply the new commits from dev into main: `git merge dev`
So, if I want to merge a branch into another to create a new history that includes the commits from those two branches, there are three scenarios to keep in mind:
1. The dev branch is basically a set of commits ahead of the main branch. After merging, the main branch is fast forwarded to the same commit that was in dev. Dev remains the same. Main now is equivalent to dev.
2. The main branch has more commits, but they are not conflicting with the ones in the dev branch. After the merge, the commits in both branches are added to main. Dev remains the same. Main now includes all the commits from dev (and perhaps some others) and the originals.
3. The main branch has different commits that modify the same files that dev, and thus there is a conflict. Git won’t know what to do (you can define a merge strategy so it can guess), so sometimes it will ask you to fix the problem by showing you the conflicts and asking for your inputs for resolution. Dev remains the same. Main will have the commits from main and dev, and another to fix the problems found.
## Remotes
So far, the developer (or designer) is working on a single computer. How can we share the code between developers if they are not working on the same computer? What happens when the code is stored in some other server (like GitHub or GitLab)? There must be a process to make sure that you have the code that is up to date with other versions of the same code and that you can publish your history to share with others. In Git, that means cloning a repository and fetching or pulling from it and pushing changes.
You can create a local copy cloning a remote repo that will create a copy of all the information in there except from webhooks and other server related configurations. It will be _your_ copy and you can work with it as it is yours. [ We will be using Github but you can use Gitlab or any other Git server ].
1. `git clone https://github.com/kubeshop/monokle`
That creates a pointer to a remote repo that by default will be named _origin _so you can compare your copy with the one stored. You can see the remotes configured and even havemore than one to work with.
1. `git remote`
2. `git remote -v`
You can use Git remote to add and delete remotes and you can have more than one remote for the same repo. This will become important afterwards when you want to do a pull request.
Getting additional information is easy. Simply fetch the new data from the remote repo.
1. `git fetch origin`
Taking this step will create new local branches for the remote branches as origin/main, origin/dev, etc. It won’t change your local history, rather it will create new copies of the remotes in your local repo so you can decide what next step you’d like to take.
At that moment you can do whatever you want with the branches including merging them into your local branch.
1. Get the branches from origin: `git fetch origin`
2. Change HEAD to the main branch so next actions happen there: `git checkout main `
3. Merge the remote history into our current branch: `git merge origin/main`
Depending on your strategy for branches, it is quite possible that merging your main branch will be a fast forward (more lately).
Even if everything is fine this is not a nice workflow, you need to change branches. If you are developing in your own branch, you will have to copy unstaged changes before changing branches, download the new changes, merge them, and go back to your working branch, and do that every time you are synchronizing your state with that of the server. Fortunately, there is a shortcut that will fetch the data from the server and try to merge into your current code in one go:
`git pull origin`
So now you have worked locally and fetched the new data merging it into your branch. You should have run your tests and everything is working. Now, you are ready to share your work with the world. In Git, you **push** your history to the remote server for a specific branch:
`git push <remote> <branch>`
Rules for pushing to different places are complicated. To get started, you will only be able to push your code if you can actually fast forward the remote to yours.
### Pull requests
Now, this is becoming more exciting! You have a working branch with your new code or fix and you want to make sure that everybody will work with that version. There are lots of ways of doing that, but at least in Open Source, and GitOps, you achieve the goal through **pull requests**.
In a pull request, you ask another developer to merge a branch into the shared repo. The main thing to keep in mind here is that before merging the PR, somebody will have the chance to review the code, discuss with the author how it has been implemented, ask for changes, or sometimes even alter it. In many cases, automatic testing and policies are applied to help with the discussion (i.e. to make sure that the code is following the proper format agreed by the team that all tests are passing, that there is no component that is introducing a security risk, etc), and sometimes more than one reviewer needs to give its ok to the new code.
Moreover, because pull requests work from one branch into a different fork of the code, you don’t have (and you don’t need) access to the original repo. You can accept changes from third parties, friends and family, or anybody wanting to improve the code in a truly open source way. You do that by forking the code, creating a full copy that is not a distributed copy of the original code but your own version of it. You can do it directly in the command line cloning a repo or in the Github interface. There is a button to create a fork in your profile or in any group that you have the proper access to.
You then can create a PR to ask for that merge from your branch in your fork into one branch in the forked repo (mostly main but again, branches are the same for git, it can be any).
## Conclusion
Git is a very powerful tool that is underpinning DevOps and GitOps in Kubernetes. Although used by many, the concepts are sometimes not fully understood and it is always good to have a quick refreshment of the key points to ensure best practices.
This article only describes the happy path of a Git workflow so you can start talking about it. In Monokle, we are working to create tools that allow GitOps and DevOps on Git in Kubernetes and we would love to hear your view on it.
Visit us to learn more about how to [simplify your Kubernetes deployment configuration](http://monokle.io/) or join the conversation in our [Discord server](https://discord.com/invite/6zupCZFQbe).
We also invite you to collaborate with us to improve our code in our [Github repo](https://github.com/kubeshop).