Senior Site Reliability Engineer

Qualys, Inc. Foster City , CA 94404

Posted 2 weeks ago

Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!

Description

We are seeking a highly motivated and talented Site Reliability Engineer to work on Qualys' Cloud Platform & Middleware technologies. Working with a team of engineers and architects, you will combine software development and systems engineering skills to build and run scalable, distributed and fault-tolerant systems.

The ideal candidate will write software to optimize day to day work through better automation, monitoring, alerting, testing and deployment.

Responsibilities

Co-develop and participate in the full lifecycle development of cloud platform services from inception and design, deployment, operation andimprovement by applying scientific principles.

Increase the effectiveness, reliability and performance of cloud platform technologies by identifying and measuring key indicators, making changes to the production systems in an automated way and evaluating the results.

Support cloud platform team before the technologies are pushed for production release through activities such as system design, capacity planning, automation of key deployments, engaging in building a strategy for production monitoring and alerting and participate in testing/verification process.

Ensure that the cloud platform technologies are maintained properly by measuring and monitoring availability, latency, performance and system health.

Advice the cloud platform team to improve the reliability of the systems in production and scale them based on need.

Participate in the development process by supporting new features, services, releases and hold an ownership mindset for the cloud platform technologies

Develop tools and automate the process for achieving large scaleprovisioning and deployment of cloud platform technologies

Participate in on-call rotation for cloud platform technologies. At times of incidents, lead incident response and be part of writing detailed postmortem analysis reports which are brutally honest with no-blame.

Propose improvements and drive efficiencies in systems and processes related to capacity planning, configuration management, scaling services, performance tuning, monitoring, alerting and root cause analysis

Requirements

5 years of relevant experience in running distributed systems at scale in production.

Expertise in one of the programming language: Java, Python or Go.

Proficient in writing bash scripts

Good understanding of SQL and NoSQL systems

Good understanding of systems programming (network stack, file system, OS services)

Understanding of network elements such as firewalls, load balancers, DNS, NAT, TLS/SSL, VLANs etc

Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and determining the root cause of incidents.

Knowledge of JVM concepts like garbage collection, heap, stack, profiling, class loading, etc.

Knowledge of best practices related to security, performance, high-availability, and disaster recovery.

Demonstrate a proven record of handling production issues, planning escalation procedures, conducting post-mortems, impact analysis, risk assessments and other related procedures.

Able to drive results and set priorities independently

BS/MS degree in Computer Science, Applied Math or related field

Bonus Points if you have:

Experience with managing large scale deployments of search engines like Elasticsearch

Experience with managing large scale deployments of message-oriented middleware such as Kafka

Experience with managing large scale deployments of RDBMS systems such as oracle

Experience with managing large scale deployments of NoSQL databases such as Cassandra

Experience with managing large scale deployments of In-memory caching using Redis, Memcached, etc.

Experience with container and orchestration technologies such as Docker, Kubernetes etc

Experience with monitoring tools such as Graphite, Grafana and

Prometheus

Experience with Hashicorp technologies such as Consul, Vault, Terraform and Vagrant

Experience with configuration management tools such as Chef, Puppet or Ansible

In-depth experience with continuous integration and continuous deployment pipelines

Exposure to Maven, Ant or Gradle for builds

Annual Salary Guidelines: $115,000 - $135,000

Qualys is an Equal Opportunity Employer, please see our EEO policy.


icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon
lc_ad

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Senior Site Reliability Engineer Data Infrastructure (Seattle)

Bytedance

Posted Yesterday

VIEW JOBS 5/3/2024 12:00:00 AM 2024-08-01T00:00 Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Helo, and Resso, a Bytedance Seattle, WA King County, WA

Senior Site Reliability Engineer

Qualys, Inc.