Site Reliability Engineer

Job description

Our true purpose at Castor

Castor is one of the leading platforms for data collection in medical research. We believe standardizing and reusing datasets is key to overcoming the healthcare challenges of the future.


How we operate

Our main Electronic Data Capture (EDC) application runs on a proven stack consisting of Ubuntu, Nginx, PHP and MySQL. For our cloud installations, we orchestrate these setups by using Terraform combined with Ansible for the server configuration management.


Due to the nature of processing medical data, we have clients in different regions across the globe, often with specific regulatory constraints around where and how their research data is stored. To meet these customer demands we combine both traditional as well as cloud-based hosting solutions.


Most of our clients prefer to run in Azure, but we’re using Google Cloud Platform for things like Kubernetes hosting of greenfield projects, blob storage for scalable file upload storage and their Key Management System (KMS) to further secure our data.


For our metrics we’ve begun standardizing on Prometheus and we’re moving towards Loki for log aggregation. We use PagerDuty for alerting, communicate via Slack and host our code on Github.


Why we’re growing our team

With our recent expansion have come new challenges, both in how we organize ourselves and in how we manage and scale our infrastructure in the future.


To further these efforts we have formed a Platform team consisting of SRE and Software Engineering, which we are now looking to grow with the addition of an additional SRE.


Additionally, due to the sensitive nature of medical data, Castor is certified for both ISO/IEC 9001 (quality) and 27001 (Information security). In addition, we have to adhere to a number of other regulations, including Good Clinical Practice (GCP) guidelines.


Our goal is to unite these requirements with emerging SRE practices around infrastructure as code and other principles to create a well designed and documented system, while still allowing us to remain flexible to change.


How you will contribute

Our absolute commitment to patient data security and privacy informs our vendor selection with certified datacenter and cloud providers. To achieve real impact in medical research, Castor needs to operate security around the world.


Historically, our production platform has run on top of managed hosting services. This model doesn’t scale well for our global, international footprint, which is why we are currently expanding our in-house knowledge and transitioning to Infrastructure-as-a-Service providers.


As a Site Reliability Engineer, you’ll have the ability to shape our operations and continuously deliver a working product. Working very closely with the development teams, you’ll collaborate in supporting and structuring our efforts around automation, observability and security. With your help we plan to scale the Castor platform to the next level.


Some things we worked on recently

Whilst there are many operational challenges as we continue to grow and scale at Castor, our Platform team has made great improvements to a variety of our systems already. To give you some examples of what we achieved last month:


  • Migrated our DNS to AWS Route53
  • Set up automatic documentation pipelines using MkDocs
  • Moved our CI/CD pipelines from Jenkins to CircleCI
  • Built a key-service on AWS Lambda to store disk encryption keys off-site for an otherwise region-local setup


Your background

You have helped run web-facing services under production workloads and have experienced the challenges that come with maintaining and scaling these systems. Making and owning decisions about systems architecture together with your team is something you enjoy and feel comfortable with.

Qualities we’re looking for include:


  • A good grasp on how *NIX systems operate 
  • The ability to evaluate and implement best practices for IT operations
  • A working knowledge of both cloud-native and traditional systems architecture and the trade-offs between them
  • Experience with a configuration management framework such as Ansible, Chef, Puppet or SaltStack
  • The ability and desire to work with a wide range of open source technologies
  • A strong privacy- and security mindset 
  • Experience with some aspects of Observability and distributed systems: from monitoring, logging and metrics instrumentation to resiliency to failure
  • A good understanding of how relational databases operate
  • Experience with at least one programming or scripting language, preferably Python or Go(lang) 
  • Knowledge that a list of skills and requirements doesn’t mean you have to tick every single box to apply ;)


How we say thank you

At Castor we truly live our core values, believing we can achieve anything with a healthy and happy team. With this in mind, we offer the following benefits:


  • Awesome new office near Amsterdam Amstelstation - the Castor Burrow!
  • A competitive salary plus an 8% annual holiday bonus 
  • An annual company bonus plan, rewarding your efforts to help Castor grow
  • An advantageous allocation in our Employee Stock Option Plan 
  • 36 days annual leave (including 6 national holiday days) 
  • Interested in ‘lifelong learning’? You’ll love using our development and training budget
  • We love good food! So, we provide lunch and healthy snacks in the office every day 
  • Want to keep your PJs on? Then work from home one day each week!
  • Flexible approach to working - nobody is tracking your time except you.
  • A new MacBook or Dell laptop, we’re a tech start-up after all ;) 
  • Like to feel zen? You’ll love our daily meditation, in our very own office ‘relax room’
  • How do we care about your wellbeing outside of work? A company subscription to Calm
  • Grow together as a team during our annual company retreat

Requirements

.