Introduction

In a production environment, it is not always easy to determine why the sql job has failed and what has happened during the job step execution. The checklist provided here will help in arriving at a decision or give clear direction to troubleshoot

Checklist: First thing first, so let us go sequentially:

  1. Check that the job is enabled.
  2. If the job runs fine while we start it manually then verify that the job’s schedule is enabled.
  3. Go through the history of the job to know the last successful run.
  4. To know more about the error, go through the SQL server error log, SQL server agent error log, windows application log. IF you think that your sql server job entries disappearing mysteriously then check  the job history log configuration.
  5. If not all the jobs are working, then check the SQL agent service is up. If not running check the account under which service is running , it should be a member of Domain user group. Other possibility could be if the password of the Domain account has been changed and not updated here. In worst case, MSDB database may get corrupt or unavailable due to which sql server agent service may not start.
  6. Some common reasons could be timeout errors or the deadlock, look for the logged events or activity monitor and work accordingly. This happens especially when the job is long running and configured to work under high isolation transactional level and queries scanning the bigger tables. Try to break the functionality, change the order of the query or rewrite them for efficiency.
  7. If the scope of the job is spread across different servers then check that if there are any issues with security or linked server configuration
  8. If invoking system commands or external application, sql agent should have appropriate permissions to carry out the task.
  9. As a last step, check if there is a data issue, to troubleshoot this take the code that you suspect and check in your test server with the latest database restored from production.

Recommendations:

  1. It is good to follow the change management process in order to be in sync with the every small change made to the system configuration and will help in troubleshooting. Before troubleshooting any failures we can see the latest changes happened in the platform with respect to database and account expiry due date.
  2. Since MSDB will be storing all the data related to SQL jobs so make sure that the database configurations support the growth of .mdf and .ldf files in the disk.
  3. Work on the amount of data to retain in history.
  4. Take the regular backup of Master and MSDB databases, because in case you fall into situation where you are rebuilding the master database with setup, it will rebuild all system databases in that instance.
  5. Make a daily check on your scheduled jobs.
  6. Create custom category for the sql jobs specific to your application to segregate them from regular maintenance jobs.
  7. Make sure that amount of data to rollback on each step has been taken care.
  8. Periodically collect the performance monitor statistics for agent jobs, which use significant amount of resources, as part of performance and improvement strategies.
  9. Try that no two job’s schedule overlaps that work on same set of tables.
  10. Plan to schedule regular maintenance jobs during off peak hours or during weekends or as per business requirements.
  11. Have your sql server agent send you a notification about success/failure of the job.
  12. 12. Keep sql server updated with the latest service pack.

Code Snippets:

/**** Below query can be used to directly pull the failed job on a

particular day and other relevant information. It can be modified to

match the exact requirement ****/

SELECT sj.[name],

sh.step_id,

sh.step_name,

sh.run_date,

sh.run_time,

sh.sql_severity,

sh.message,

sh.server

FROM msdb.dbo.sysjobs sj

INNER JOIN msdb.dbo.sysjobhistory sh

ON sh.job_id = sj.job_id

INNER JOIN msdb.dbo.sysjobsteps ss

ON sj.job_id = ss.job_id

AND sh.step_id = ss.step_id

WHERE sh.run_status = 0 /*Failure-0, Success-1*/

AND sh.run_date > 'Provide Date here'

ORDER BY sh.instance_id ASC

Conclusion

By following the checklist and resolution steps i hope we can easily troubleshoot SQL Server Job Failures. Drop down your comments

kick it on DotNetKicks.com

Related Posts

  1. SQL Server 2000 Maintenance Jobs Failed with Error 22029
  2. SQL Server 2005 database DBA Checklist for database administrator
  3. Troubleshooting SSRS Error “The report server is not responding”
  4. Database Mail in SQL Server 2005/2008
  5. Encryption and Decryption in SQL Server 2005

Tags: , , ,

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>