At Causal, we’re constantly pushing the limits of our platform, requiring creative and smart problem solving to overcome those hurdles. One recent major win was scaling our spreadsheet calculation engine to handle billions of cells. Succeeding in these journeys is not only rewarding and a technical feat, but table stakes for our customers who require more from system.
Infrastructure and code go hand-in-hand in that mission, so DevOps is a critical part of the equation. To date the team has built out solid foundations to make these goals possible.
Multiple (often many) per day production deployments are guided by a fully automated CI/CD pipeline running largely in CircleCI. After PRs are thoroughly peer reviewed and signed off, changes are deployed to production within minutes from clicking merge. We hit elite marks on the DORA scale. Code is vetted through a variety of tests suites, including e2e verification via Cypress. Developer onboarding is super fast, thanks to a CLI that does a lot of the heavy lifting. Production performance and issues are monitored by Sentry, GCP, UptimeBot and more.
We’re at the point of needing to take our infrastructure to the next level to support the demands of our customers and upcoming roadmap. Top of mind for those wishes includes:
Building out the system with infrastructure as Code (IaC), likely with Terraform
- Why? Right now environments are manually configured and updated, which isn’t viable as the complexity of the stack grows, and to support needs listed below.
PR Review apps / Ephemeral environments - with cloud spend efficiency
- Why? We take testing seriously, especially from a customer/UX standpoint. It’s easy for an engineer to spin up and test locally, but not for non-technical groups. There is a shared staging environment to help, but that requires coordination and time to oversee the process.
Isolating expensive requests to dedicated services and endpoints
- Why? larger workloads by request type and often the scale of data being operated on can hold up smaller requests that should otherwise be cheap and quick to compute
Rate limiting requests by customer and workload expense
- Why? Similar to the above, a single request that requires a billion operations shouldn’t be limited the same as one that only necessitates ten.
Offloading CPU intensive calculations to workers/queues/services
- Why? Particularly in the Node world, don’t want to block the main thread, holding up all responses, and instead offload the responsibility to a language more suited for CPU crunching (like Go). Aka, let Node thrive in it’s async i/o heavy realm.
Anonymizing data for real world replications in lower environments
- Why? We’ve all been there, it’s only easily reproducible in production, but data privacy is super important.
Building out better performance testing infrastructure
- Why? Scale and performance are non-negotiable, we need to be sure of changes before exposing to customers.
More robust alerting and monitoring
- Why? Increasing signal to noise ratio, to ensure we’re quickly informed of issues before customers are fully aware.
Scaling Postgres and SQL optimizations
- Why? As a core data backer, fetching and caching results optimally will always been an important focus
Do these problems excite you?! Have the experience it takes to lead these initiatives and see them through to completeness and beyond? We have a number of open roles, including a DevOps/infrastructure engineer. Drop an email to firstname.lastname@example.org (CTO) if you’re interested in joining our ever-growing, passionate and fun team!