Site Reliability Engineer - 2056

Sugar CRM Houston , TX 77020

Posted 3 months ago

About SugarCRM, Inc.

SugarCRM is a customer experience leader enabling businesses to create profitable customer relationships by delivering highly relevant, personalized experiences throughout the customer journey. We empower companies to strengthen existing customer relationships, create new ones through actionable insights and intelligent automation and better understand the customer at every stage of the journey. This enables businesses to accelerate demand generation, grow revenue, deliver superior customer care and increase loyalty. Our easy-to-use, intuitive platform makes customer experience easy and accessible for everyone, allowing marketing, sales and services professionals to focus on high-impact, value-adding activities that create customers for life.

Where do you fit?

To join our growing team, SugarCRM is currently seeking an experienced Site Reliability Engineer. This role can be based in one of our U.S.-based offices or remote.

Impact you will make in the role:

  • Manage applications in a CentOS Linux-based environment

  • Build repeatable infrastructures with Ansible

  • Develop and execute plans for rolling out new technologies rapidly

  • Improve monitoring infrastructure, build out data aggregation and alerting rules

  • Work closely with engineering to build scalable solutions

  • Triage tickets raised by our support organization and implement fixes

  • Support our private and public cloud environments and customers

  • Mentor other members of the Operations team

  • Participate in an on-call rotation

Expertise you will bring in:

  • BA/BS in Computer Science with Network Engineering or Information Systems emphasis, or equivalent work experience

  • Extensive knowledge with container orchestration technologies including Docker and Kubernetes

  • 6+ years experience in an Operations or Systems Administration role

  • Superior Unix administration skills

  • Extensive knowledge of common Internet Protocols

  • Extensive knowledge of TCP/IP

  • Experience with virtualization and cloud technologies

  • Hardware management, network switch and router administration experience

  • Experience with Apache, MySQL, and PHP in a production environment at scale

  • Strong knowledge of version control systems and hands-on experience with Git

  • Experience with writing code around infrastructure automation

  • Understanding of how to architect and implement highly available, scalable, and secure network in multiple cloud environments

  • Strong affinity and experience in working with continuous deployment and continuous integration environments

  • An understanding around micro-service architectures and the complexities around their deployments

  • Extensive programming experience in PHP, Ruby, Python, and Shell

  • Full stack troubleshooting and instrumentation experience

  • Extensive experience with IT automation technologies like Puppet, Salt, Chef, or Ansible

  • Experience with data aggregation, alerting, and reporting and supporting technologies such as Sensu and Graphite

Nice to haves:

  • Experience in an on-call rotation

  • Experience with Elastic Search or Apache Solr

  • Experience with Spinnaker and/or other CI/CD tools

  • Previous experience as a mentor or advisor

  • Current contributor to open source projects (a Github account you can link us to would be ideal)

Location: Cupertino, CA., Raleigh NC., Atlanta, GA, Orlando, FL, or Remote, U.S.

We are an Equal Opportunity, Affirmative Action employer. Minorities, women, veterans and individuals with disabilities are encouraged to apply.

Benefits and Perks:

Beyond a stellar work environment, friendly people, and inspiring, innovative work, we have some great benefits and perks:

  • Competitive salaries

  • Excellent medical, dental and vision coverage for you and your family, along with other benefit plans like 401(k) match

  • Unlimited Paid Time Off

  • Wellness Reimbursement Program

  • Onsite Programs, depending on location, such as Dry Cleaning, Car Washes, Massage, Yoga, and more

  • Career & Personal Development Program multi-platform

  • Regular social events

  • Ownership is the greatest self-identity at SugarCRM - you are making an impact now

  • We are a merit-based company - many opportunities to learn, excel and grow your career

Note to Recruiters and Placement Agencies: SugarCRM does not accept unsolicited agency resumes. Please do not forward unsolicited agency resumes to our website or to any SugarCRM employee. SugarCRM will not pay fees to any third-party agency or firm and will not be responsible for any agency fees associated with unsolicited resumes. Unsolicited resumes received will be considered property of SugarCRM and will be processed accordingly.


icon no score

See how you match
to the job

Find your dream job anywhere
with the LiveCareer app.
Mobile App Icon
Download the
LiveCareer app and find
your dream job anywhere
App Store Icon Google Play Icon
lc_ad

Boost your job search productivity with our
free Chrome Extension!

lc_apply_tool GET EXTENSION

Similar Jobs

Want to see jobs matched to your resume? Upload One Now! Remove
Site Reliability Engineer

Discover Financial Services

Posted 1 week ago

VIEW JOBS 10/5/2019 12:00:00 AM 2020-01-03T00:00 At Discover, be part of a culture where diversity, teamwork and collaboration reign. Join a company that is just as employee-focused as it is on its customers and is consistently awarded for both. We're all about people, and our employees are why Discover is a great place to work. Be the reason we help millions of consumers build a brighter financial future and achieve yours along the way with a rewarding career. Site Reliability Engineers (SREs) are a hybrid of systems and software engineers who are responsible for scaling, automation, and production issue support for applications. SRE's have an intense passion for finding and improving efficiencies with infrastructure, development and deployment automation. As a SRE, you will lead the efforts of application deployment, reliability, scalability, availability and performance alongside the engineering and infrastructure teams. Site Reliability Engineers will work closely with our engineering teams to build mature, production-ready services and applications. As part of the SRE team, you will help define our standards for monitoring, alerting, scalability, and production-readiness. You will monitor and report on the uptime of our systems and services, the performance of our applications, and the capacity of our platform. You will be empowered (yes, empowered) to apply software engineering techniques and discipline to production operations and help us deliver the world's greatest solutions. You will provide feedback into the architecture and application design for each next generation of Payment Services development. If you are the type of person that loves driving technology problem solving sessions; has a tireless passion to increase the performance, resiliency and availability of IT solutions serving the greatest Customers and Partners in the World; we believe our SRE opportunity will allow you to be the superstar of all superstars! To be clear, the position is responsible for the provisioning, benchmarking, tuning, and improving the end to end customer experience for our Payment Services platforms. In our industry where millions of dollars move every day and milliseconds count in every transaction you are always looking for ways to ensure our customers get the best response time. You will also be deeply involved in system roadmap planning and release management activities as well. Overall, you will become a rock star subject matter expert on the operation of these world class core systems powering our great Fortune 300 Company (which really operates like a startup). Additionally you will promote a risk-aware culture, ensure efficient and effective risk and compliance management practices by adhering to required standards and processes. To be successful (and we know you can be), you will need to have a strong IT understanding with work experience in off- and on-premise cloud based and virtual system infrastructure and peripheral services including network, firewall, and database management. We also need you to understand the application development and quality assurance ends of the spectrum as you will need to interface with that crew as well. During problem escalations you are the driver of the team that finds the root cause, restores functionality, and proposes the long term solution. Sounds awesome doesn't it? We think so but we ultimately need you to make this a reality. You will be exposed to the latest technologies in the Industry while helping us create the next generation of Payment Service solutions (mobile payments, remote commerce, IoT payments, etc.). It's all cutting edge and you have the opportunity to be right in the middle of it. If you're motivated by leading your work vs. following a checklist, enjoy advocating for and driving change as well as inventing features or projects that solve a business challenge. Join our team. Do not hesitate as the naming rights to this team are still open to the early hires!!! Responsibilities: * Ability enhance and maintain complex software components and distributed systems. * Create and manage a continuous build, integration, test, and deployment system * Monitor, alert, analyze and troubleshooti large scale distributed systems * Work with clustering technologies - high availability, resiliency and horizontal scaling. Good understanding of defining and executing High Availability, Disaster Recovery, Sustained Resiliency, Chaos Engineering tests * Control application code deployment servers and code deployment methods * OS tuning, optimization and system requirements for vertical scaling * Understand networking concepts and experience with HTTP protocol * Lead and participate in performance tests, identifies the bottlenecks, opportunities for optimization and capacity demands * Monitor and report on SLA/SLO for a given applications services. Work with business and product owners to establish key performance indicators * Work with team and leadership to develop the long term Site Reliability Engineering road map. * Maintain (evaluate and upgrade) all platform required applications and libraries (java, python, etc) * Partner with security engineers and develop plans and automation to aggressively and safely respond to new risks and vulnerabilities. * Control application log collection and analysis - Automate processes and systems configuration/deployment * Design and architect operational solutions for managing applications and infrastructure, with the specific goal of increasing the automation, repeatability, and consistency of operational tasks. * Create and maintain monitoring technologies and processes that improve the visibility to our applications' performance and business metrics and keep operational workload reasonable. * Define and drive adoption of a best in class monitoring frameworks to accomplish end to end application or service monitoring and noiseless alerting end application or service monitoring and noiseless alerting with proper telemetry * Analyze and participate in periodic on-call duties to prevent, solve and automate the response to problems in mission critical services and automated deployment * Work with Release Manager and development teams to deploy software releases * Self manages the effort split between operational work and engineering work Additionally: * Operational Performance & Stability: Works with other members of their assigned Value Stream to ensure that the in-scope applications/platforms are meeting performance and stability requirements. This includes managing Major Incidents to Mitigation/Resolution. Problem Management: Performs Post-Incident Reviews of all Major Incidents and determining Action Items required to avoid similar issues/minimize downtime for future Incidents. Monitors and Metrics: Works with Application Development to ensure that assigned applications/platforms have the appropriate monitoring and metrics in place to appropriately measure performance and stability. Identify Functional and Non-Functional Improvements: Acts as the Operations representative in Value Stream planning and prioritizes sessions to ensure that Operational needs of assigned applications/platforms are addressed as needed. Holds quarterly Operational Performance Reviews with Value Stream management. * Release Planning & Coordination: Works with other members of their assigned Value Stream to ensure that the Production releases for their in scope applications/platforms are properly planned and coordinated. This includes Holds Change/Release implementation reviews to ensure thorough and appropriate implementation plans. Provides review and sign-off/approval of change tickets for the assigned Value Stream. Represents the Value Stream in Change Advisory Board Meetings. Participates in Program Increment Planning Sessions as a liaison for Operations and Infrastructure support. Provides information regarding upcoming critical changes to the Value Stream. * Operational Readiness: Ensures that applications/platforms in the Value Stream are Operationally ready for Production. This includes Annual Review of all SOPs/Knowledge Articles. Monitors review for any new Feature launch or other significant change that may impact monitoring. SOP/Knowledge Article review for any new Feature launch or other significant change that may impact support documentation. Training of Command Center and Application 1st level Support on new SOPs, Knowledge Articles, and any other support-related needs. Performs Monthly Capacity Analysis of applications/platforms within the Value Stream. Creates and Maintains Operationally focused ELK Dashboards for the Value Stream * Responsible for the Operational Stability and Performance of one or more Critical Business Services used by Discover Customers and Employees. #LI-MF1 Responsible for the Operational Stability and Performance of one or more Critical Business Services used by Discover Customers and Employees. Minimum Qualifications At a minimum, here's what we need from you: * Bachelor's Degree in Business, Computer Information Systems, Computer Science, MIS, Engineering, Science, or related field * 2+ years of experience in Information Technology, or related field * In lieu of a degree, 4+ years of experience in Information Technology, or related field Preferred Qualifications If we had our say, we'd also look for: * 4+ years of experience in Technology, or related field * Proficiency in one or more general purpose programming languages: Python, Go, shell scripting (Unix/Linux), Java * Automation tools experience such as Chef, Puppet, Ansible. Developing monitoring tools and log analysis tools to manage operation * Continued curiosity regarding new technologies and evolving best practices * 2 years of coding experience using strongly typed language Java, Golan * 2 years of experience in SRE, DevOps, or similar rol * 2 years of experience with scripting languages like Python / Bas * Familiar with design principles of monitoring and alerting system * Deep knowledge of distributed pub-sub message systems Discover Financial Services is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran status, among other things, or as a qualified individual with a disability. Discover Financial Services Houston TX

Site Reliability Engineer - 2056

Sugar CRM