Blog

OPTIMIZATION OF QUERIES/ TABLES/ STORED PROCEDURES IN SQL SERVER

05/03/2015 01:39

SOME MORE TIPS FOR OPTIMIZATION OF QUERIES/ TABLES/ STORED PROCEDURES IN SQL SERVER

a. Table should have minimum of one clustered index and appropriate number of non clustered indexes, which should be created on columns of table based       on query which is running following the priority order as WHERE clause, then JOIN clause, then ORDER BY clause and finally the SELECT clause. 

b. Avoid using Triggers if possible; incorporate the logic of trigger in a stored procedure.

          1. When creating a stored procedure, we can specify WITH RECOMPILE option in it. But such stored procedure will never benefit from the cached                  execution plan as each time it is executed; it forces the cached execution plan to invalidate or flush and create a new plan based upon the                            parameters passed, if any, to it. I do not see any such big benefits of this option. But one may find this useful when such stored procedure will return             results or execute    only selective part of the stored procedure body based upon supplied input parameters. For example, statements within If-block             or Select-Case block based upon input parameters.

         2.This option is good when the table properties are changed, and this table is in use of many other stored procedures. Instead of recompiling each and              every such depending stored procedure, a simple sp_recompilewill do enough, and with no server restart!

         3. Situations like when we mix DDL and DML statements together inside stored procedure may also cause stored procedure recompilation.

         4.We should always create a stored procedure with a full naming convention. Schema name should always prefix the stored procedure name. Schema               name will help SQL Server name resolution easily when it is called. This helps SQL Server in which schema to query the stored procedure.

        5.Tables should have proper indexes and should be compiled time to time as indexes may be weird off after some time due to huge data insertion or                deletion.

        6.Include SET NOCOUNT ON statement: With every SELECT and DML statement, the SQL server returns a message that indicates the number of               affected rows by that statement. This information is mostly helpful in debugging the code, but it is useless after that. By setting SET NOCOUNT ON,             we can disable the feature of returning this extra information. For stored procedures that contain several statements or contain Transact-SQL loops,               setting SET NOCOUNT to ON can provide a significant performance boost because network traffic is greatly reduced.

       7.Try to avoid using SQL Server cursors whenever possible: Cursor uses a lot of resources for overhead processing to maintain current record                  position in a recordset and this decreases the performance. If we need to process records one-by-one in a loop, then we should use the WHILE clause.         Wherever possible, we should replace the cursor-based approach with SET-based approach. Because the SQL Server engine is designed and optimized          to perform SET-based operation very fast. Again, please note cursor is also a kind of WHILE Loop.

      8.Keep the Transaction as short as possible: The length of transaction affects blocking and deadlocking. Exclusive lock is not released until the end of         transaction. In higher isolation level, the shared locks are also aged with transaction. Therefore, lengthy transaction means locks for longer time and               locks for longer time turns into blocking. In some cases, blocking also converts into deadlocks. So, for faster execution and less blocking, the transaction         should be kept as short as possible.

     9.Use TRY-Catch for error handling: Prior to SQL server 2005 version code for error handling, there was a big portion of actual code because an error          check statement was written after every t-sql statement. More code always consumes more resources and time. In SQL Server 2005, a new simple way          is introduced for the same purpose.

c. Table should have a primary key. 

d. Try to use constraints for selection instead of using triggers, whenever possible. Constraints are efficient than triggers enhance performance. So, you             should use constraints instead of triggers, whenever possible. 

e. Try to use table variables instead of temporary tables as table variables require less locking resources as well as less logging resources than the temporary     tables, so table variables should be used whenever possible.

 f. Avoid the use of views or replace views with original tables. 

g. Try to avoid the use of DISTINCT clause, where ever possible. As the DISTINCT clause will result in performance degradation, we should use this clause       only when it is necessary or unavoidable. 

h. Try to add SET NOCOUNT ON statement into your stored procedures as it stops the message indicating the number of rows affected by a SQL statement.     It also reduces network traffic, because our client will not receive any message indicating the number of rows affected by a SQL statement. 

i. Try to use TOP keyword or the SET ROWCOUNT statement in the select statements, if we need to return only the first n rows. This can improve                  performance of our queries, as the smaller result set will be returned. It can also reduce the traffic between the server and the clients. 

j. Try to use user-defined functions to keep the encapsulated code for reuse in future. The user-defined functions (UDFs) contain one or more SQL statements    that can be used to encapsulate code for reuse. Using UDFs can also reduce the network traffic. 

k. If possible move the logic of UDF to SP as well. 

l. If you need to delete all rows of a table, try to use TRUNCATE TABLE command instead of DELETE command. Using the TRUNCATE TABLE is a faster    way to delete all rows of a tables, because it removes all rows from a table without logging each row delete. 

m. Remove any unnecessary joins from tables. 

n. If there is cursor used in query, see if there is any other way to avoid the usage of this (either by SELECT … INTO or INSERT … INTO, etc). Try to avoid       using cursors whenever possible. As SQL Server cursors can result in some performance degradation as compared to select statements. Try to use               correlated sub-queries or derived tables for row-by-row operations on tables. 

o. When writing a sub-query (a SELECT statement within the WHERE or HAVING clause of another SQL statement): 

           1. Try to use a correlated (refers to at least one value from the outer query) sub-query when the return is relatively small and/or other criteria are                        efficient i.e. if the tables within the sub-query have efficient indexes. 

          2. Try to use a non-correlated (does not refer to the outer query) sub-query when dealing with large tables from which you expect a large return (many                 rows) and/or if the tables within the sub-query do not have efficient indexes. 

         3. Ensure that multiple subqueries are in the most efficient order. 

         4.Remember that rewriting a sub-query as a join can sometimes increase efficiency. 

p. Use char/varchar columns data type, instead of nchar/nvarchar data type if we do not need to store Unicode data. The char/varchar data value uses only         one byte to store one character; whereas the nchar/nvarchar value uses two bytes to store one character, so the char/varchar columns use two times less       space to store data as compared to nchar/nvarchar data columns.

q. Try to use stored procedures instead of heavy queries as they can reduce network traffic, because our client will send to server only stored procedure name     (along with some parameters) instead of heavy and lengthy queries text. Stored procedures can be used to enhance security. For example, we can give         different users, different set of permissions to execute the stored procedure to work with the restricted set of the columns and data. 

r. We should try to return an integer value from a RETURN statement instead of returning an integer value as a part of a record set. As the RETURN                 statement exits from a stored procedure unconditionally, so the statements following the RETURN statement are not executed. The RETURN statement is     generally used for error checking, but we can also use this statement to return an integer value for any other reason. Using RETURN statement can               improve performance as SQL Server will not create a record set. 

s. Try to drop indexes that are not being used. Because each index takes up disk space and slow the DML operations, we should drop indexes that are not         used. We can use Index Wizard to identify indexes that are not being used in our SQL queries. 

t. We should try to create indexes on columns that have integer values rather than character values. Because the integer values have less size than the size       of characters values, so we can reduce the number of index pages which are being used to store the index keys. This finally reduces the number of reads       required to read the index and enhances the overall index performance. 

u. If we need to join several tables very frequently, then we should consider creating index on the joined columns which can significantly improve performance     of the queries against the joined tables. 

v. Try to avoid any operations on the fields, where ever possible. Some operations will prevent the use of index on a field even if it exists—for example,             ltrim(rtrim(FieldColumnName)) as such operations will degrade the performance. For example, instead of using the condition cast(DateColumn as                   varchar(20)) = @dateStr, we should try to convert @dateStr to an expression of datetime type and then compare it to DateColumn value. 

CONCLUSION Query optimization has a very big impact on the performance of a DBMS and it continuously evolves with new, more sophisticated optimization strategies. Query optimization is a common task performed by database administrators and application designers in order to tune the overall performance of the database system. Even if you have a powerful infrastructure, the performance can be significantly degraded by inefficient queries. So, we should try to follow the general tips as mentioned above to get a better performance of queries. Optimization can be achieved with some efforts if we make it a general practice to follow the rules. The techniques described in this paper allow basic optimization of queries, tables, indexes and stored procedures for performance gains. The main focus was on query optimizations. 

Package Tuning for OLE-DB

05/03/2015 01:14
Let’s discuss few necessary steps when we want to optimize the extracting and loading process in the package. Before you make any changes in your package, always use SSIS advance functionality to monitor and log metadata about package execution (SSIS Logging, SSIS Performance Counter, and SQL Server Profiler)
 
Steps you should check while creating SSIS package: 
 
1. Remove unwanted columns from the source and set data types appropriately; this will help you with buffer management also try to use SQL command to fetch the source data because when you use "Table or view" or "Table name or view name from variable" data access mode in OLEDB source. It behaves like SELECT * and pulls all the columns.
2. Check your system has sufficient memory available that way you can set property to have    small number of large buffers also try to tune DefaultBufferMaxSizeand DefaultBufferMaxRows properties of data flow task and test your package.
3. Checkpoint: This allows your package to start from the last point of failure on next execution. By enabling this feature you can save a lot of time for successfully executed tasks and start the package execution from the task which failed in last execution. You can enable this feature for your package by setting three properties (CheckpointFileName, CheckpointUsage and SaveCheckpoints )
4. The SSIS project has the RunInOptimizedMode property, which is applicable at design time only, which if you set to TRUE ensures all the data flow tasks are run in optimized mode irrespective of individual settings at the data flow task level.
5. There are few settings you will have to do with OLEDB destination which can definitely impact the performance of your package process.
 
·  Data Access Mode: Use 'fast load' option for uploading data into the destination table which works as a Bulk Insert Statement.
·  Keep Identity: By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own.
·  Keep Nulls: unchecked means default value will be inserted during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table's column will be ignored and preserved NULL of the source column will be inserted into the destination.
·  Table Lock: check this option unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.
·  Check Constraints: un-check it if you are sure that the incoming data is not going to violate constraints of the destination table.
 
There are two more settings in OLEDB Destination, Which are very important to understand to improve the performance.
a. Rows per batch – The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows in a batch.
b. Maximum insert commit size – The default value for this setting is '2147483647' (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this setting to indicate that commit will be done for those number of records.
 
There are other factors also which impact the performance, one of them is resource availability. So please do proper testing by you before putting these changes into your production environment.

Execute SSIS via Stored Procedure (SSIS 2012)

05/03/2015 00:59

The SSIS 2012 Catalog integration with SQL Server comes with the advantage of being able to execute SSIS packages indigenously from within T-SQL, along with a host of other functionality including improved monitoring and logging, integrated security and obviously the new deployment model.

Executing SSIS packages by calling a stored procedure can be handy in many situations, but also comes with some (albeit minor) stipulations and quirks, this post tries to go through some of the basics of executing SSIS packages through stored procedures, while also touching lightly on some of the issues I encountered while having a go at this new SSIS execution method.

 

SSIS packages are executed asynchronously through a collection of procedures, these procedures perform functions such as initialize the execution context of a package, control the package/project/environment parameters, and kick off a particular execution context.

So lets begin with an example, we start by kicking-off an SSIS package execution context using the stored procedure: [SSISDB].[catalog].[create_execution]:

    DECLARE @exec_id BIGINT

    EXEC [SSISDB].[catalog].[create_execution] 
        @package_name=N'Test.dtsx',     --SSIS package name TABLE:(SELECT * FROM [SSISDB].internal.packages)
        @folder_name=N'Warehouse ETLs', --Folder were the package lives TABLE:(SELECT * FROM [SSISDB].internal.folders)
        @project_name=N'ImportPackages',--Project name were SSIS package lives TABLE:(SELECT * FROM [SSISDB].internal.projects)
        @use32bitruntime=FALSE, 
        @reference_id=NULL,             --Environment reference, if null then no environment configuration is applied.
        @execution_id=@exec_id OUTPUT   --The paramter is outputed and contains the execution_id of your SSIS execution context.

    SELECT @exec_id

This does not actually run the SSIS package, but instead it starts an execution context, with a specific execution_id value which is returned to the user as a result of executing the stored procedure, this execution_id will be used for various functions such as adding parameters to the execution context, starting the execution of the package and checking on the status of the package.

Once a package is created, it will have an Execution Status of Created (1). You can query the package execution status (in numerical format) using the following stored procedure:

  SELECT [STATUS]
  FROM [SSISDB].[internal].[operations]
  WHERE operation_id = @exec_id

There is a handy tip by Jamie Thomson about SSIS Execution Statuses.

Just one more thing to note here, if you get any of the project/folder/package names wrong you’ll get an error that looks like this:
Cannot access the package or the package does not exist. Verify that the package exists and that the user has permissions to it.

Just make sure you get the names correct, you can use the SELECT procedures above to verify the names you could use for each parameters.

Now in order to add parameters to an SSIS package execution context, you can call the following stored procedure:

    DECLARE @IntParam sql_variant = 2 --Some random parameter value, needs to be in sql_variant format

    EXEC [SSISDB].[catalog].[set_execution_parameter_value] 
        @exec_id,  -- The execution_id value we received by calling [create_execution]
        @object_type=30,  --30 is Package Parameters, you can also use 20 for Project parameters or 50 for Environment
        @parameter_name=N'ParamDateOffset',  --Parameter name
        @parameter_value=@IntParam

The code above is pretty much self explanatory, again make sure you get the object_typeparameters correct here (i.e. whether they are package level parameters of project level parameters etc.), otherwise you’ll get a nasty error about parameter not existing or not having permission to set them.

Finally when you are ready to execute an SSIS package, you can use the following simple stored procedure:

    EXEC [SSISDB].[catalog].[start_execution] @exec_id

This will change the package Execution Status from Created (1) to Running (2).

It is important to note that if a package is left un-executed (and in the Created (1) Execution Status), then the next time SSIS maintenance job runs it will change the Execution Status toEnded Unexpectedly (6).

Here you could monitor the SSIS package through the relevant log tables in the SSIS Catalog DB, or using the more friendly SSRS reports to follow the progress of your SSIS package. If everything goes well, the SSIS package Execution Status will change to Succeeded (8).

SSIS Windows Authentication Account Requirement

In order to execute SSIS packages, you will need to execute the above stored procedures using a Windows Authentication Account rather than SQL Server Authentication Account, otherwise you will be at the receiving end of the following error:

The operation cannot be started by an account that uses SQL Server Authentication. Start the operation with an account that uses Windows Authentication.
A .NET Framework error occurred during execution of user-defined routine or aggregate “start_execution_internal”:
System.Data.SqlClient.SqlException: The operation cannot be started by an account that uses SQL Server Authentication. Start the operation with an account that uses Windows Authentication.

Although annoying, this requirement makes sense considering SSIS executes outside the context of SQL Server and as part of an OS process, which means that a Windows account is necessary in order to execute a package.

I tried changing the execution context (through EXECUTE AS) but that didn’t seem to help, so I resorted to using a Windows Authenticated Account in order to execute an SSIS package via stored procedure.

Ways to Optimize and Improve Performance of your SSIS Package

05/03/2015 00:00

I have been brought in on this project to suggest a few techniques to improve the performance of the scaled-out SSIS environment, by basically increasing the throughput to the staging database, below I discuss some of the general approaches I have taken to achieve that goal. Some of the advice might be a bit random, and others might not be pertinent to your particular situation, but over-all you should find a gem or two on optimising SSIS  performance in there somewhere!

 

Minimize SSIS File Movement

This was a big issue in the environment I was analyzing, considering the SSIS packages are processing and transforming raw text files, there was a lot of data movement in order to distributed the files to the SSIS VMs for processing, then back to the distribution server for archiving.

Considering there was a large number of files, the SSIS process was having to share I/O and Network resources with the file copy process, causing sluggishness while both processes were being executed at the same time.

Ideally all file movement should be minimized if I/O is a bottleneck in your SSIS environment, this can be accomplished through fast access shared storage (that appears as if it was local to each of the SSIS VMs), HyperV has a clustered disk solution that means each VM would see the disk as local storage, but in reality they are shared across multiple VMs that can edit files concurrently. It is important to note that normal network share causes files to be copied locally (to where the SSIS process is being executed) before execution can begin, which causes a huge delay in the throughout.

Another thing to consider is that if all the file movement steps are required in your execution logic, for example in my scenario we had:

Distribution Box (Downloading data) –(move)> SSIS Box (Processing data) –(move)> Distribution Box (Archiving data)

We reworked this path to minimize network traffic by copying the file from the distribution box to the SSIS box, while archiving it the same time, once SSIS finished processing a file it will just delete it. This means that files are transferred over network only once.

Additionally, SSIS, as with many data processing solutions, works better on larger files rather than a large number of small files, this is because it takes time to establish a connection to a new file each time, hence to shave off these extra milliseconds, it is recommended to combine the data files into smaller number of files. It makes a difference when processing very large number of files.

Drop/Create Indexes on Staging Tables

Pushing data into an indexed table will take much more time than pushing the data into an empty table and then rebuilding indexes afterwards.

This is because while pushing data into an indexed table, the database engine is having to build the index alongside the updates to the table, which causes a considerable delay.

Additionally, and depending on your insert pattern, if you are inserting data into an indexed table you might end-up with a fragmented index, which is slower to access and deal with than a non-fragmented index (obviously), hence it is highly recommended to follow the index drop/recreate procedure below:

  • Drop all indexes on destination table
  • Push data (perform ETL operations)
  • Create all indexes on destination table

Dedicate VM Resources

I have noticed, while analyzing the live environment I was trying to optimise, that VMWare (which is our virtualization provider) seems to de-allocate and reallocate CPU resources on the SSIS boxes depending on the activity (and the activity trend over time), this is to allow the host to re-allocate that resource around the VM stack, giving boxes that require more resources a way to get what they need, while maintaining operations on existing VMs.

This is all well and good, if you have a steady workload that doesn’t change over time, but if it does, then the box will suffer from “resource suffocation” until the VM controller decides to re-allocate the resources back to the box. This re-allocation will take some time during which your process is probably running on very low throughout.

If you have a variable load, it is highly recommended to allocate non-reclaimable resources (Memory and CPU) to those boxes, this will reduce resource contention due to the dynamic VM resource re-arrangement and provide consistent performance figures.

Enable Bulk Operations on the SSIS Login

By providing the login being used to load data into the database from SSIS (the Destination Connection login) with the BulkAdmin server-role you will enable SSIS to load the data in bulk, if it is possible.

This recommendation has to be in synergy with utilizing the fast load data access mode in the Destination component of the SSIS’s Data Flow, which allows SSIS to utilize the BULK INSERTstatement.

Set FastParse Option on the Flat File Source

The give-away is in the name, the FastParse option allows the Flat File data source output to be parsed more quickly (sometimes significantly) at the expense of supporting locale-specific data formats (such as numeric, date and time data). This is an option definitely worth considering if you do not have locale specific data or can control the format of the input files. Also, it is something worth keeping in mind when building an output file that is intended to be consumed by an SSIS package.

The option can be located for each column (or node) under the following path: Flat File Source -> Show Advance Editor -> Input and Output Properties (tab) -> Flat File Source Output (node) -> Output Columns (node).

FastParse Option

Configure and Tweak SSIS Destination Operators

There are a few options you could tweak on the SQL Server Data Destination to better accommodate the data you are trying to load, these are:

  • MaximumInsertCommitSize: Used to determine how much data is pushed into the table before SSIS commits the transaction, by default this is set to 2147483647, but should be tweaked to better suite the type of data being pushed. This value’s effect becomes more important in an environment were multiple SSIS packages are loading data into one table, and each package is trying to take a table lock on the destination table. Its important to note that this value also influence how much is rolled-back if the SSIS fails before committing the transaction.
  • Check Constraints: This flags whether the destination component should check the constraint imposed on the destination table when inserting data, un-ticking this box will speed up loading into the destination table, in the cases were there are constraints on the table.

Configure Connection Packet Size Property

By default, an SSIS connection is setup with a packet transfer size (Packet Size) of 4,096 bytes (4KB), this is all well and good for old networks that can’t handle larger packet size contention, but these days you will be hard done to find an enterprise grade network that isn’t capable of handling packet sizes 8 times the size of that.

Packaging data into bigger size packets over the network will reduces the overhead associated with the packaging and unpackaging of those packets, leading to better throughput.

It is generally recommended to increase this value to 32,767 (32KB) which is the maximum allowed value. It is highly recommended to monitor your network and ensure no contention or packet loss is happening due to this newly configured value.

Destination Table Partitioning

The importance of partitioning can not be stressed enough in a scaled-out SSIS environment. Whether each SSIS package inserts into a separate staging table which then gets switched into one staging table (ideal), or SSIS itself inserts into specific partitions of the staging table (you’ll need to take out table-lock escalation), partitioning (when done correctly) should eleminate any bottlenecks to do with sql engine table lock specific waits, which can be considerable in a multi-ssis environment.

Complete Index Concepts in Sql Server

19/02/2015 18:54

What is difference between index seek and index scan ?

When will sql server choose index seek and index scan ?

How to rebuild indexes ?

why index seek is perform faster than index scan ?

what is row lookup ?

Difference Between Clustered index and Heap ?

Difference Between Clustered index and Non- Clustered Index? 

why Non-clustered index scan is faster than clustered index scan ?

Difference Between index seek and index scan.docx (731052)

 

What are issues we faced in Excel source to load data.

19/02/2015 18:17

what is open rowset?

what is IMEX=1 (Import Export mode)

what is excel connection string?

why we get nulls in target table. while loading from excel?

How to load Alphanumeric datatypes to targer table?

What is the version of excel source and in which version your machine is working? (32 bit or 64 bit)

How to load data from particular row, column ?

How to load excel worksheet with different sheetnames in work book ?

what is the enumerator in foreach loop container ?

 

loading from excel to table.docx (1905380)