Convergenz Reston , VA 20190
In this technical incident management function, manage incidents to resolution in a 24/7/365 environment using incident management processes, effectively guide incident triage calls from a technical perspective, share technical details obtained from monitoring tools and dashboards to aid troubleshooting, outline details of resolution activities, recommend and implement improved processes, provide timely status updates to stakeholders, assist with postmortem related activities and support various efforts related to operational improvements. Manage efforts to maintain application in production, including troubleshooting stoppages, repairing bugs, documenting application performance, and coordinating with technology infrastructure management.
KEY JOB FUNCTIONS
Manage IT production incidents to resolution in a 24/7/365 environment using the standard incident management processes and inform management at all levels of status, impact and resolution actions.
Effectively lead and guide Incident triage calls from a technical perspective analyzing different components of the infrastructure and application environment via the use of a variety of monitoring tools and processes.
Eyes on glass monitoring of the health of applications as well as the underlying infrastructure. Proactively look for hardware, software, and environmental alerts or malfunctions. Ability to analyze dashboards and reporting/monitoring tools to look at trends and patterns in application health and performance.
Troubleshoot the incidents and identify root cause quickly using operations, wire data analytics, application performance management and event correlation monitoring tools.
Perform analysis of data, evaluating multiple application protocols including web, database, storage, and supporting infrastructure such as DNS, LDAP, SSL, SMTP, and FTP. Participate in findings review sessions.
Review performance and trends of multiple application protocols and provide recommendations for service improvement.
Assist with instrumentation of critical applications, including setting up the "custom dashboards" for the customer's business, operations and development.
Act as the 24x7 escalation point for production support to diagnose and resolve complex issues by providing factual data from the various monitoring and instrumentation systems.
Influence other technical teams on the calls and articulate troubleshooting steps effectively.
Ensure monitoring alerts and systems events are assessed, prioritized and managed. Drive the continuous improvement of services and processes in order to increase platform stability and realize operational efficiencies.
Lead required technical follow-up calls for high profile incidents.
Assist with documentation of root cause analysis (RCA) and ensure follow ups on problem tickets and tasks around problem management.
Ensure appropriate functional and management escalation takes place as per the standards and procedures.
Follow up on items that could potentially negatively impact production operations, assist with postmortem related activities and support various efforts related to operational improvements.
Based on recommendations from management, implement new and improved processes, change processes, perform new tasks, create reports and address ad-hoc requests.
Bachelor's Degree or equivalent required
6+ years of related experience
SPECIALIZED KNOWLEDGE & SKILLS
7+ years of working experience with different IT Infrastructure components such as Unix/ Linux Servers, Wintel Servers, networks, firewalls, routers, load balancers, VPN, Apache, web logic, LDAP, Active Directory, Exchange, Oracle/MS SQL databases, SAN, Virtualization, Email systems, Enterprise monitoring and access management solutions for single sign on. Subject matter expertise is not required and experience with at least eight of the above is preferred.
Working experience with a wide variety of monitoring and data/log analysis tools such as Extrahop, Dynatrace, Netcool suite, Catchpoint, Moogsoft (Event Management and ChatOps), ELK, Splunk among others.
Proven methodical approach to problem identification, problem solving and resolution.
Experience working with cloud infrastructure environments.
6+ years of working experience with applications in a production support environment using above technologies. Management and troubleshooting of Middleware products on UNIX and Linux environments. Knowledge of Service Oriented Architecture (SOA), Java etc.
Ability to analyze different components of the infrastructure and application environments during Incident triage calls.
Aptitude to influence other technical teams on the incident calls and articulate troubleshooting steps effectively.
Experience and confidence working with all levels of management; excellent written and verbal skills.
Able to quickly and concisely communicate with senior management on technical issues in non-technical terms and to run large conference calls during Incident calls with a wide range of personnel and management levels.
Ability to occasionally work on nights and weekends if and when major incidents occur.
Strong relationship management skills and aptitude to multi-task and work well in a high stress environment, both within teams and independently.
Proficiency with Word, Excel and PowerPoint and presenting to senior management using data and information from these tools.
ITIL or PMP certification is desired.
Financial services industry experience is preferred.
Linux/Unix, Extrahop, Dynatrace, Moogsoft, Splunk