As we’ve learned how machines work, we can now learn how to distribute runs across multiple machines.

There are main two ways how to distribute nodes inside of workflow to run in their maximum capacity:

  • Using Single Splitter Node
  • Using batch output pattern

Using Single Splitter Node

Splitter node is a node that splits the input data into multiple strings. Let’s say you have a file called to-distribute.txt. The file contains the following data:

example.com
test.com
trickest.com

And you want to distribute these domains to multiple machines and to different tools. You can use a splitter node to split the data into multiple strings and then connect these strings to different nodes (tools).

When the distribution of the nodes is happening, platform will provide a visualization showing that multiple nodes are running in parallel

Because of various cases where splitter nodes can be overwhelming to the infrastructure you are targeting and additionally to the platform, you should primarily use file inputs which are more efficient and easier to manage.

Before we check the second way of distributing nodes, let’s see how to convert distributed string back to file to be used as file input.

Here, we used a node called string-to-file which is a very basic one, and it is used to convert the string back to the file and help other tools continue the distribution. In our case, both string-to-file and nuclei will execute three times for each of domains described in the to-distribute.txt file.

Platform will continue to distribute the nodes until the output of the last node is connected to folder output, you can check more on how inputs and outputs work in the inputs and outputs tutorial

Let’s say we want to stop nuclei results and merge everything, we can do it by connecting the recursively-cat-all (or any script) to the nuclei output and then connect the custom-script to the folder output.

This script will go through all folders inside of it like in/nuclei-1/1/output.txt, in/nuclei-1/2/output.txt, in/nuclei-1/3/output.txt and merge them into one file out/output.txt.

Take a look at recursively-cat-all script, you can see that visualization showing the distribution is gone, this means that splitter chain is stopped and that this script got all results in the file and folder structure mentioned above.

Using batch output pattern

Now, let’s take a look at more complext pattern. This pattern is used when you have a lot of data to distribute and you want to distribute it in a more efficient way.

This pattern is almost always tailored to tools which have file inputs and you want to distribute the inputs in batches.

Let’s use a bigger file as example:

example.com
uber.com
trickest.com
trickest.io
somesite.com
acme.com
whatever.com
yahoo.com
something.com

In this file, we have 9 domains. We want to distribute these domains to multiple machines and to different tools. We can use a batch output pattern to split the data into multiple files and then connect these files to different nodes (tools).

Our fleet has 3 machines, and we want to distribute these domains to 3 machines. We can use a batch pattern to split the data into 3 files, each containing 3 domains. This pattern consists of 3 main nodes:

generate-line-batches

This node is used to calculate the number of lines in the input file length to be used for batch-output node. It will generate a file with the number of lines in the input file.

batch-output

This node is used to split the input file into multiple files. It will split the input file into multiple files, each containing a specified number of lines.

and of course, for this we will need to use file-splitter

This is how the pattern looks like:

If we take a closer look at generate-line-batches script, we can see that it is calculating the number of lines in the input file and then it is creating a file with the number of lines in the input file.

BATCH_SIZE=3

find in -type f -exec cat {} + > /tmp/merged.txt
FILE_SIZE=$(grep -c "" /tmp/merged.txt)

if [ "$BATCH_SIZE" -gt "$FILE_SIZE" ]; then
  BATCH_SIZE="$FILE_SIZE"
fi

if [ "$FILE_SIZE" -eq 1 ] || [ "$FILE_SIZE" -eq 0 ]; then
  echo "1,1" | tee out/output.txt
else
  for ((i=1;i<FILE_SIZE;i+=BATCH_SIZE))
  do
    echo $i,$(($i+$BATCH_SIZE-1))
  done | tee out/output.txt
fi

If you are not familiar with bash scripting, don’t worry, the only variable you will need to change is BATCH_SIZE which is the number of lines in the input file that will be in each batch.

Notice also how the initial list is being passed to the batch-output node. So even if you don’t understand the script, you will be able to understand that this script is getting:

  1. Entire list of domains
  2. Lines from-to which will be used for each batch calculated by generate-line-batches script

Now, we can connect other tools, in our example we will use nuclei tool to scan these domains.

When using the batch output pattern, keep in mind that number of batches generated should not be greater than 4000.

Continuing the distribution

Let’s take a list of much more subdomains, for example 10000. And we want to scan for web servers, do nuclei scan, and then merge all of the results.

What we need to do:

  1. Set the BATCH_SIZE to 100 in the generate-line-batches script. In this case, we will have 100 domains in each batch, and the httpx and nuclei will each execute 100 times, which is less than current splitter limit.
  2. Connect the httpx to the batch-output output, and then connect the nuclei to the batch-output output.
  3. Connect the recursively-cat-all script to the nuclei output, and then connect the custom-script to the folder output.

Let’s do even more, and crawl all of the webservers found by httpx and then create two separate files in recursively-cat-all script, one for nuclei and one for katana.

When we connect katana to httpx it will also execute 100 times. Just like nuclei will.

Let’s now change the label of the recursively-cat-all script to merge-all-results and change the script so we can have two files:

  • nuclei-results.txt
  • katana-results.txt

When you click on the node you can see how folder structure looks like

We have here two folders with this structure:

in
  nuclei-1
    1/output.txt
    2/output.txt
    3/output.txt
    ...
    10/output.txt
  katana-1
    1/output.txt
    2/output.txt
    3/output.txt
    ... 
    10/output.txttput.txt

So inside our new merge-all-results script we will have:

cat in/nuclei-1/*/output.txt > out/nuclei-results.txt
cat in/katana-1/*/output.txt > out/katana-results.txt

Let’s take a look how it looks like in a workflow

Hope this tutorial clarifies the mechanics of workflow distribution and what makes Trickest platform hyperscalable. Imagine running these workflows on hundreds of machines? That’s what we are doing for you, so you can focus on the results and not on the infrastructure management.