Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: directed acyclic graph (DAG) #1563

Merged
merged 14 commits into from
Apr 9, 2024
Merged

feat: directed acyclic graph (DAG) #1563

merged 14 commits into from
Apr 9, 2024

Conversation

pd93
Copy link
Member

@pd93 pd93 commented Mar 25, 2024

This PR overhauls how Task reads, parses and merges Taskfiles.

An example project

The following document will get quite technical, so I'm going to use an example to help explain. Imagine a project with the following Taskfile structure:

├── services
│   ├── serviceA
│   │   └── Taskfile.yml
│   ├── serviceB
│   │   └── Taskfile.yml
│   └── serviceC
│       └── Taskfile.yml
├── taskfiles
│   └── utils.yml
└── Taskfile.yml

We have three "services", each with their own Taskfile, and a root Taskfile that includes all of them. This allows us to call task from the project root and get access to all the tasks in each of the services. In addition to this, we also have a taskfiles directory with a utils.yml file that contains some common tasks that are shared between the services and the tasks in the root file.

This is a fairly typical, if not simple, project structure for a Taskfile-based project and on the surface, Task handles it pretty well. However, as we are about to see, under the hood, Task does not process this very efficiently, and when you add more services and more shared utilities, or run on hardware with limited processing capability, the inefficiencies can become problematic.

The current process

For the sake of brevity, I'm not going to write up the contents of each file. Instead, I am going to illustrate the includes by drawing a graph:

flowchart TD
    R(Taskfile.yml)
    A(services/serviceA.yml)
    B(services/serviceB.yml)
    C(services/serviceC.yml)
    U(taskfiles/utils.yml)
    R --> A
    R --> B
    R --> C
    R --> U
    A --> U
    B --> U
    C --> U
Loading

You can see that our root Taskfile (Taskfile.yml) is including each of the service Taskfiles and the utilities Taskfile. Each of the service Taskfiles is then also including the utilities Taskfile. The total setup consist of 5 files.

When Task is called, the first thing it does it look for the root Taskfile (aka. the "entrypoint"). The contents of this files are then read into memory and parsed using the taskfile/ast package. So far, so good.

Once the entrypoint has been parsed, we can now start to evaluate any includes that may be specified in the file. In this example, we have 4 includes:

  • services/serviceA.yml
  • services/serviceB.yml
  • services/serviceC.yml
  • taskfiles/utils.yml

Now, for each of these taskfiles, we will recursively repeat the process of reading the files into memory and parsing them. Once all the includes have been followed, we then unwind the stack and merge the ASTs together. This results in one giant Taskfile the contains all the tasks from all Taskfiles in the project.

If you haven't spotted it yet, there is a major problem with this approach. Let's jump into the code and add a log line at the beginning of each call to the function that reads a Taskfile that print the name of the file being read. Here are the results:

Taskfile.yml
services/service1/Taskfile.yml
taskfiles/utils.yml
services/service2/Taskfile.yml
taskfiles/utils.yml
services/service3/Taskfile.yml
taskfiles/utils.yml
taskfiles/utils.yml

As you can see, despite only having 5 files in the project, Task has read a file 8 times. The taskfiles/utils.yml file has been read 4 times! The number of times a file is read will correspond directly to the total number of includes in all your Taskfiles combined (or the number of arrow heads in the graph diagram) +1 for the entrypoint.

Analysing the problem

There are many simple solutions to the problem described above. For example, we could just add a simple file cache so that a file is not read multiple times. However, I think that the problems with the current approach go deeper than just the inefficiency of reading files multiple times.

I briefly mentioned that the ASTs are merged together before any tasks are run. There are a few disadvantages to this approach:

  • Merging is very memory intensive and involves deep copying structures between Taskfiles.
  • The code is complex and difficult to maintain.
    • This increases the chance of bugs.
    • and increases the barrier to entry for contributors.
  • You lose information about the location of data when merging.
    • The includes the scopes of variables and tasks.
    • Currently, we attempt to save some of this information by storing it in the merged Taskfile, but this just adds to the complexity/maintenance cost of the code.
  • Most of the time, you don't actually need the entire Taskfile tree to be read.
    • Often, we just want to run a Task in the root Taskfile. Why bother reading and merging all the other Taskfiles?

It doesn't matter how good our file caching is if we end up calling a merge method for each and every include. Finding a way to store the entire Taskfile tree in memory without merging them together would be a much better solution.

A new approach

Taskfiles are no different to other programming languages or config-based tools that have includes or imports. We usually refer to the structures that store dependencies as graphs (as-per the example diagram). Specifically, these are dependency graphs, which are a type of directed acyclic graph (or DAG for short).

DAGs have a number of properties that make them very useful:

  • As the name suggests, they are acyclic, which means we get cycle prevention for free. This solves issues like this which occur because of our naive approach to cycle detection.
  • It gives us file caching for free. A DAG allows us to store/cache the AST of a files once it has been read. Since the key of the node in the graph is the file's location, this will be retrieved next time its needed rather than being read from the filesystem again.
  • DAGs are a well understood data structure and there are many algorithms and libraries available for working with them.
  • They allow us to easily output and visualise the Taskfile tree for debugging or illustration purposes.
  • Most importantly, they allow us to store the entire Taskfile tree in memory without merging them together. This means that we can solve all of the issues mentioned in the previous section.

This PR

An important point to make - This PR does not remove AST merging. The DAG implementation has taken a lot of work to complete. The first DAG changes were made by me nearly a year ago and I have been slowly cherry-picking refactors and improvements from the dag branch into main. This PR represents the final piece of work to bring the DAG implementation to release.

However, this is still just a stepping stone. Currently, the DAG is still resolved into a single AST before tasks are run. The next step would be to refactor the code that fetches/compiles tasks to run off the DAG instead of the merged AST. From here, there are a number of things that we can do...

Future work

This work enables or makes easier a number of future improvements:

@pd93 pd93 marked this pull request as ready for review March 25, 2024 21:13
@pd93 pd93 requested a review from andreynering March 25, 2024 21:14
Copy link
Member

@andreynering andreynering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it look a lot of time. Thanks for your perseverance! This is going to be a great improvement. 👏 👏 👏

@pd93 pd93 merged commit 4024b4f into main Apr 9, 2024
13 checks passed
@pd93 pd93 deleted the dag branch April 9, 2024 11:37
pd93 added a commit that referenced this pull request Apr 9, 2024
@marco-m
Copy link
Contributor

marco-m commented May 9, 2024

@pd93 kudos, great work!

@ReillyBrogan
Copy link
Contributor

Is #852 resolved with this?

@pd93
Copy link
Member Author

pd93 commented May 13, 2024

Is #852 resolved with this?

@ReillyBrogan Unfortunately not. I've responded in #852 to keep everything together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants