Distribution
Learn how to efficiently distribute workflow nodes across multiple machines using splitter nodes and batch output patterns.
As we’ve learned how machines work, we can now learn how to distribute runs across multiple machines.
There are main two ways how to distribute nodes inside of workflow to run in their maximum capacity:
- Using Single Splitter Node
- Using batch output pattern
Using Single Splitter Node
Splitter node is a node that splits the input data into multiple strings. Let’s say you have a file called to-distribute.txt
. The file contains the following data:
And you want to distribute these domains to multiple machines and to different tools. You can use a splitter node to split the data into multiple strings and then connect these strings to different nodes (tools).
Because of various cases where splitter nodes can be overwhelming to the infrastructure you are targeting and additionally to the platform, you should primarily use file inputs which are more efficient and easier to manage.
Before we check the second way of distributing nodes, let’s see how to convert distributed string back to file to be used as file input.
Here, we used a node called string-to-file
which is a very basic one, and it is used to convert the string back to the file and help other tools continue the distribution. In our case, both string-to-file
and nuclei
will execute three times for each of domains described in the to-distribute.txt
file.
folder
output, you can check more on how inputs and outputs work in the inputs and outputs tutorialLet’s say we want to stop nuclei results and merge everything, we can do it by connecting the recursively-cat-all
(or any script) to the nuclei
output and then connect the custom-script
to the folder
output.
This script will go through all folders inside of it like in/nuclei-1/1/output.txt
, in/nuclei-1/2/output.txt
, in/nuclei-1/3/output.txt
and merge them into one file out/output.txt
.
recursively-cat-all
script, you can see that visualization showing the distribution is gone, this means that splitter chain is stopped and that this script got all results in the file and folder structure mentioned above.Using batch output pattern
Now, let’s take a look at more complext pattern. This pattern is used when you have a lot of data to distribute and you want to distribute it in a more efficient way.
This pattern is almost always tailored to tools which have file
inputs and you want to distribute the inputs in batches.
Let’s use a bigger file as example:
In this file, we have 9
domains. We want to distribute these domains to multiple machines and to different tools. We can use a batch output pattern to split the data into multiple files and then connect these files to different nodes (tools).
Our fleet has 3
machines, and we want to distribute these domains to 3 machines. We can use a batch pattern to split the data into 3 files, each containing 3 domains. This pattern consists of 3 main nodes:
generate-line-batches
This node is used to calculate the number of lines in the input file length to be used for batch-output
node. It will generate a file with the number of lines in the input file.
batch-output
This node is used to split the input file into multiple files. It will split the input file into multiple files, each containing a specified number of lines.
and of course, for this we will need to use file-splitter
This is how the pattern looks like:
If we take a closer look at generate-line-batches
script, we can see that it is calculating the number of lines in the input file and then it is creating a file with the number of lines in the input file.
If you are not familiar with bash scripting, don’t worry, the only variable you will need to change is BATCH_SIZE
which is the number of lines in the input file that will be in each batch.
Notice also how the initial list is being passed to the batch-output
node. So even if you don’t understand the script, you will be able to understand that this script is getting:
- Entire list of domains
- Lines from-to which will be used for each batch calculated by
generate-line-batches
script
Now, we can connect other tools, in our example we will use nuclei
tool to scan these domains.
Continuing the distribution
Let’s take a list of much more subdomains, for example 10000. And we want to scan for web servers, do nuclei scan, and then merge all of the results.
What we need to do:
- Set the
BATCH_SIZE
to100
in thegenerate-line-batches
script. In this case, we will have100
domains in each batch, and thehttpx
andnuclei
will each execute100
times, which is less than current splitter limit. - Connect the
httpx
to thebatch-output
output, and then connect thenuclei
to thebatch-output
output. - Connect the
recursively-cat-all
script to thenuclei
output, and then connect thecustom-script
to thefolder
output.
Let’s do even more, and crawl all of the webservers found by httpx
and then create two separate files in recursively-cat-all
script, one for nuclei
and one for katana
.
When we connect katana
to httpx
it will also execute 100
times. Just like nuclei
will.
Let’s now change the label of the recursively-cat-all
script to merge-all-results
and change the script so we can have two files:
nuclei-results.txt
katana-results.txt
When you click on the node you can see how folder structure looks like
We have here two folders with this structure:
So inside our new merge-all-results
script we will have:
Let’s take a look how it looks like in a workflow
Hope this tutorial clarifies the mechanics of workflow distribution and what makes Trickest platform hyperscalable. Imagine running these workflows on hundreds of machines? That’s what we are doing for you, so you can focus on the results and not on the infrastructure management.