Roadmap 2

Goals:

Milestone: node pools sizing, node affinity, and cloud burn, operations

Milestone: development productivity

Repository Organization

  • Application platform services

  • Database configuration ( schemas, users)

  • Terraform

    • for AWS (CDN, IAM, SQS)

    • for DO (k8s)

Node Affinity

Select an approach and develop rules for assigning pods to appropriate node pools

  1. Microservices -> Microservices node pool

    1. APIs

    2. Consumers

    3. Cron jobs

  2. Cloud Services -> Cloud services node pool

    1. RabbitMQ

    2. Ambassador

    3. Airflow

    4. CKAN (?)

  3. Monitoring Services -> Monitoring node pool

    1. ELK Stack

    2. Prometheus

    3. Grafana

Efficient Operation

Identify automated processes for managing resources (like indices) and cleanup

Access

Develop an approach to easily provide access to resources for developer productivity.

  1. Types of access

    1. DO Control Panel (dev)

    2. k8s Control Panel (dev)

    3. User specific databases

    4. Development database access

    5. IAM (S3 buckets)

    6. RabittMQ development queues

    7. ELK stack

      1. Dev,test available to all developers who are part of the github organization

      2. Production, only certain developers when required

    8. Grafana -> Solved via Github

      1. All 3 environments

    9. Etc

  2. Access Request / Granting Process

    1. Elevated Access team for people who get access to stuff

    2. Issue tracking

Developer Productivity

Environments

  • Develop an improved approach to provisioning development environments with multiple microservices.

Grafana dashboards

Documentation on how to use the observability tools (grafana, ELK stack)

Service kits

Have pipeline check yaml for k8s in PR workflow to validity and disabled settings

  • No imagepullpolicy=always

  • Require node affinity

  • Make sure kustomize runs successfully

  • Lint for template manifests that have not been populated with deployment values

https://argo-cd.readthedocs.io/en/stable/

Cloud Burn

Observability - resources are expensive

  • Should we filter while logs are going into the ELK stack.

Operating Cloud Services

Improve airflow DAG deployment strategy

Set up rabbitmq admin panel as reverse proxy and expose via mapping in ambassador

Microservices Auth

auth0 - paid

okta - paid

keycloak - open source

Application Level Monitoring

Can we capture metrics for application level activities, like number of tokens transferred, new captures ingested, website access, active organizations in admin panel.

Terraform

  • Protect Credentials in encrypted files

  • Separate customizations per environment for each module

  • Collect all terraform into one directory

  • Create utility scripts

Notes:

Node Affinity

Node Pools

  • We need bigger nodes for monitoring, identify available funds now

DevOps/Developer Productivity process and leader(s)

  • Provisioning resources

  • Initial automation setup

  • Knowledge base tools / ticket

  • feature diff when Deploying services into test / production

  • End to end testing

  • Partial staging - admin, map, wallet

  • Notify github action when a deployment fails on the k8s side

  • Airflow engineers - what do they need

Organize/Standardize Terraform

ELK stack for audit logging ( persistence and naming scheme, logging approach)

Additional Services

  • CKAN

Last updated