doc version: 0.1.0
doc date: 28.3.2022
Current system implemented consisting of ArgoWorkflows for workflow orchestration and Rancher for node spawning.
Argo while doing us good and working as expected it's too complicated of a system for our, at the moment, basic needs, and besides that implementing new features is very hard if it's proven to be an edge case for the Argo maintenance team.
Rancher is cluster orchestrator and the nature of our work is not in direct correlation with its goals, multiple issues were noted using it such as:
- Difficult to predict node behavior due to abstraction wrapped around underlying node management system(EKS autoscaling groups) that is itself doing provisioning to the underlying K8S which provides a variety of issues.
- Increased complexity of the system in terms that we now have to manage
localRancher cluster and the rest of the worker clusters.
- Cost of the infrastructure to manage these clusters is significant, and it’s not scalable due to how Rancher needs nodes in order to maintain underlying clusters.
- The need to have multiple ArgoWorkflow servers per namespace due to isolation needs which therefore increase the number of worker controller nodes which are mostly just idle.
- General complexity issues are probably the worst of all above, this system is way too complex for the use-case we need in reality.
To tackle with issues above system explained bellow is explained and it’s expected to solve following issues:
- Reduce system complexity by removing K8S worker cluster and instead run on the independent nodes.
- Reduce infrastructure cost.
- Make it easier to implement features into the workflow system by maintaining it.
- Having overall more stable system as less moving parts will be involved as opposed to running K8S worker cluster.
TWE is imagined as a server-agent architecture system that is pluggable. It would consist of following services:
- Server(OS) - stateless workflow orchestrator and artifact management that would be a part of our existing platform K8S cluster.
- Agent(OS) - it would be present on the executor node to manage running of the containers, and latter artifact management.
- NodeController - is responsible for node lifecycle management
We would aim to make
server as flexible as possible and interface as many components as we can, basically make it pluggable, by doing this we achieve control over the open source(OS) project we maintain and the enterprise features we offer to our clients among other things.
Communication would initially be implemented over the HTTP using REST methodology with the tendency to do it over RPC using GraphQL when project matures. To ensure this communication is safe we would implement HTTPS, limit node communication using ideally cloud providers methods like security groups on the AWS, and implementing per job RBAC on the server side.
Artifacts would mostly be managed over the
server in order to limit the access to the keys/secrets used to communicate to different storage interfaces like S3 or Nexus artifact storage. Here we have options to configure
agent to store or pull artifacts directly to the node, this would speed up transactions of large artifacts, we could even implement sharing keys with specific artifact access that
server generates and sends to the
agent to achieve same effect with more security.
Node management would be ideally managed using separate service and utilising Pulumi, it would besides creating nodes be in charge of keeping node information on usage, ownership, reservations. Backend services(Hive API) would be communicating with it and request nodes of type n for client with identifier x.
This component is expected to be stateless service storing its state on the relational DB like PSQL. Its main responsibilities are:
- Assigning jobs to agents.
- Managing artifacts that jobs create.
- Managing agent pool.
- Resolving workflow template variables.
- Orchestration of the whole workflow process from logging/state perspective
Engineering effort would consist of engineering DAG system, and template resolution, logic behind workflow orchestration, REST server implementation, and state management. Out of those efforts DAG, and templating resolution mechanism pose most difficulty for engineers.
This will enable us to use multiple different backends to store artifacts, it will come very useful with on-prem clients that are in need of storage within their domain as well. Here we want to cover some basic functionalities like CRUD operations over artifacts, and some more complex ones in the future that include generating access keys/secrets for specific artifact access that would be later forwarded to the agent for direct artifact importing.
Enables support for multiple node lifecycle management engines which include
local engine. When implementing on-prem we could leave it to the client to develop their own node interface if needed. This interface should support reserving nodes for jobs, getting agent whereabouts and such. This will enable us to use multiple node types like
static which represent what we have at the moment which is having nodes tied to the specific job until it finishes execution, and
dynamic nodes which can serve as ad-hoc nodes for execution where we can pair it with things like AWS reserved instance capacity or spot instances.
It enables us to keep state in the different forms like DB or local storage like SQLite which is important for OS initiative. In essence this one is of lesser importance at first but will make OS product much easier to use later.
We expect from the backend services to post the workflow template to the
twe-server, collect information like artifact path and host, stream output from the job, check the job status.
This one is node side component, it is planed to have these preinstalled on the machine images across the cloud providers, we can also implement self upgrading system utilizing
server in order to achieve instant upgrade in critical situations. Its main functionality is running container on the engine, monitoring its execution flow, and collect and send artifacts to the storage. It will communicate over HTTPS on the port n, we need to make sure to restrict traffic towards our infrastructure/platform network to that single port/protocol combination.
We want to provide support for running on the container runtimes like
CRI-O, here we need to make sure that each provides necessary abstractions. Needless to say
Docker Engine will be enough to begin with. We also need to make sure that we are able to capture
stderr of each container and store it on the host machine.
Making this project Open Source(OS) is easy due to its modularity, we could offer community core product that is modular with the least modules. Proposed modules are
local module for
local module for
sqllite module for
docker module for
We could extend some module functionality like
server.node with basic possibilities, enterprise versions of these would have support for spot instances, dynamic instances, reserved capacity, in this way we would not risk quality of enterprise product.
GET STARTED WITH TRICKEST TODAY
Fill out our early access form to put yourself on the waitlist and stay in the loop.