Best Practices Informatica
Best Practices:
The following lists some performance considerations when building mappings.
1). Filter Expressions – try to evaluate them in a port expression:
Try to create the filter (true/false) answer inside a port expression upstream. Complex filter expressions slow down the mapping. Again, expressions/conditions operate fastest in an Expression Object with an output port for the result. Turns out – the longer the expression, or the more complex – the more severe the speed degradation. Place the actual expression (complex or not) in an EXPRESSION OBJECT upstream from the filter. Compute a single numerical flag: 1 for true, 0 for false as an output port. Pump this in to the filter – you should see the maximum performance ability with this configuration.
2). Remove all “DEFAULT” value expressions where possible.
Having a default value – even the “ERROR(xxx)” command slows down the session. It causes an unnecessary evaluation of values for every data element in the map. The only time you want to use “DEFAULT value is when you have to provide a default value for a specific port. There is another method: placing a variable with an IIF(xxxx, DEFAULT VALUE, xxxx) condition within an expression. This will always be faster (if assigned to an output port) than a default value.
3). Variable Ports are “slower” than Output Expressions.
Whenever possible, use output expressions instead of variable ports. The variables are good for “static – and state driven” but do slow down the processing time – as they are allocated/reallocated each pass of a row through the expression transform.
4). Datatype conversion – perform it in a port expression.
Simply mapping a string to an integer, or an integer to a string will perform the conversion, however it will be slower than creating an output port with an expression like: to_integer(xxxx) and mapping an integer to an integer. It’s because PMServer is left to decide if the conversion can be done mid-stream which seems to slow things down.
5). Unused Ports.
Surprisingly, unused output ports have no affect on performance. This is a good thing. However in general it is good practice to remove any unused ports in the mapping, including variables. Unfortunately – there is no “quick” method for identifying unused ports.
6). String Functions.
String functions definitely have an impact on performance. Particularly those that change the length of a string (substring, ltrim, rtrim, etc.). These functions slow the map down considerably, the operations behind each string function are expensive (de-allocate, and re-allocate memory within a READER block in the session). String functions are a necessary and important part of ETL, we do not recommend removing their use completely, only try to limit them to necessary operations. One of the ways to tune these is to use “varchar/varchar2″ data types in your database sources, or to use delimited strings in source flat files (as much as possible). This will help reduce the need for “trimming” input. If your sources are in a database, perform the LTRIM/RTRIM functions on the data coming in from a database SQL statement, this will be much faster than operationally performing it mid-stream.
7). IIF Conditionals are costly.
When possible – arrange the logic to minimize the use of IIF conditionals. This is not particular to Informatica. It is costly in ANY programming language. It introduces “decisions” within the tool; it also introduces multiple code paths across the logic (thus increasing complexity). Therefore – when possible, avoid utilizing an IIF conditional – again, the only possibility here might be (for example) an ORACLE DECODE function applied to a SQL source.
8). Sequence Generators slow down mappings.
Unfortunately there is no “fast” and easy way to create sequence generators. The cost is not that high for using a sequence generator inside of Informatica, particularly if you are caching values (cache at around 2000) – seems to be the suite spot. However – if at all avoidable, this is one “card” up a sleeve that can be played. If you don’t absolutely need the sequence number in the map for calculation reasons, and you are utilizing Oracle, then let SQL*Loader create the sequence generator for all Insert Rows. If you’re using Sybase, don’t specify the Identity column as a target – let the Sybase Server generate the column. Also – try to avoid “reusable” sequence generators – they tend to slow the session down further, even with cached values.
9). Test Expressions slow down sessions.
Expressions such as: IS_SPACES tend slow down the mappings, this is a data validation expression which has to run through the entire string to determine if it is spaces, much the same as IS_NUMBER has to validate an entire string. These expressions (if at all avoidable) should be removed in cases where it is not necessary to “test” prior to conversion. Be aware however, that direct conversion without testing (conversion of an invalid value) will kill the transformation. If you absolutely need a test expression for a numerical value, try this: IIF( * 1 >= 0,,NULL) preferably you don’t care if it’s zero. An alpha in this expression should return a NULL to the computation. Yes – the IIF condition is slightly faster than the IS_NUMBER – because IS_NUMBER parses the entire string, where the multiplication operator is the actual speed gain.
10). Reduce Number of OBJETS in a map.
Frequently, the idea of these tools is to make the “data translation map” as easy as possible. All to often, that means creating “an” expression for each throughput/translation (taking it to an extreme of course). Each object adds computational overhead to the session and timings may suffer. Sometimes if performance is an issue / goal, you can integrate several expressions in to one expression object, thus reducing the “object” overhead. In doing so – you could speed up the map.
11). Update Expressions – Session set to Update Else Insert.
If you have this switch turned on – it will definitely slow the session down – Informatica performs 2 operations for each row: update (w/PK), then if it returns a ZERO rows updated, performs an insert. The way to speed this up is to “know” ahead of time if you need to issue a DD_UPDATE or DD_INSERT inside the mapping, then tell the update strategy what to do. After which you can change the session setting to: INSERT and UPDATE AS UPDATE or UPDATE AS INSERT.
12). Multiple Targets are too slow.
Frequently mappings are generated with multiple targets, and sometimes multiple sources. This (despite first appearances) can really burn up time. If the architecture permits change, and the users support re-work, then try to change the architecture -> 1 map per target is the general rule of thumb. Once reaching one map per target, the tuning gets easier. Sometimes it helps to reduce it to 1 source and 1 target per map. But – if the architecture allows more modularization 1 map per target usually does the trick. Going further, you could break it up: 1 map per target per operation (such as insert vs update). In doing this, it will provide a few more cards to the deck with which you can “tune” the session, as well as the target table itself. Going this route also introduces parallel operations. For further info on this topic, see my architecture presentations on Staging Tables, and 3rd normal form architecture (Corporate Data Warehouse Slides).
13). Slow Sources – Flat Files.
If you’ve got slow sources, and these sources are flat files, you can look at some of the following possibilities. If the sources reside on a different machine, and you’ve opened a named pipe to get them across the network – then you’ve opened (potentially) a can of worms. You’ve introduced the network speed as a variable on the speed of the flat file source. Try to compress the source file, FTP PUT it on the local machine (local to PMServer), decompress it, and then utilize it as a source. If you’re reaching across the network to a relational table – and the session is pulling many rows (over 10,000) then the source system itself may be slow. You may be better off using a source system extract program to dump it to file first, then follow the above instructions. However, there is something your SA’s and Network Ops folks could do (if necessary) – this is covered in detail in the advanced section. They could backbone the two servers together with a dedicated network line (no hubs, routers, or other items in between the two machines). At the very least, they could put the two machines on the same sub-net. Now, if your file is local to PMServer but is still slow, examine the location of the file (which device is it on). If it’s not on an INTERNAL DISK then it will be slower than if it were on an internal disk (C drive for you folks on NT). This doesn’t mean a unix file LINK exists locally, and the file is remote – it means the actual file is local.
14). Too Many Aggregators.
If your map has more than 1 aggregator, chances are the session will run very slowly – unless the CACHE directory is extremely fast, and your drive seek/access times are very high. Even still, placing aggregators end-to-end in mappings will slow the session down by factors of at least 2. This is because of all the I/O activity being a bottleneck in Informatica. What needs to be known here is that Informatica’s products: PM / PC up through 4.7x are NOT built for parallel processing. In other words, the internal core doesn’t put the aggregators on threads, nor does it put the I/O on threads – therefore being a single strung process it becomes easy for a part of the session/map to become a “blocked” process by I/O factors.
15). Maplets containing Aggregators.
Maplets are a good source for replicating data logic. But just because an aggregator is in a maplet doesn’t mean it won’t affect the mapping. The reason maplets that don’t affect speed of the mappings is they are treated as a part of the mapping once the session starts. In other words, if you have an aggregator in a maplet, followed by another aggregator in a mapping you will still have the problem mentioned above in #14. Reduce the number of aggregators in the entire mapping (including maplets) to 1 if possible. If necessary split the mapping up in to several different mappings, use intermediate tables in the database if required to achieve processing goals.
16). Eliminate “too many lookups”.
What happens and why? Well – with too many lookups, your cache is eaten in memory – particularly on the 1.6 / 4.6 products. The end result is that there is no memory left for the sessions to run in. The DTM reader/writer/transformer threads are not left with enough memory to be able to run efficiently. PC 1.7, PM 4.7 solve some of these problems by caching some of these lookups out to disk when the cache is full. But you still end up with contention – in this case, with too many lookups, you’re trading in Memory Contention for Disk Contention. The memory contention may be worse than the disk contention because the system OS end’s up thrashing (swapping in and out of TEMP/SWAP disk space) with small block sizes to try and locate your lookup row. As the row goes from lookup to lookup, the swapping / thrashing gets worse.
17). Lookups & Aggregators Fight.
The lookups and the aggregators fight for memory space as discussed above. Each requires Index Cache, and Data Cache and they “share” the same HEAP segments inside the core. Particularly in the 4.6 / 1.6 products and prior – these memory areas become critical, and when dealing with many rows – the session is almost certain to cause the server to “thrash” memory in and out of the OS Swap space. If possible, separate the maps – perform the lookups in the first section of the maps, position the data in an intermediate target table – then a second map reads the target table and performs the aggregation (also provides the option for a group by to be done within the database)… Another speed improvement…