A Lookup is a passive transformation, whereas a Joiner is an active transformation.
For every record, a Lookup returns only one record regardless of whether a match is found or not. A Joiner, depending on the type of join defined, can return more, less, or no records at all.
A Joiner can join pipelines from the same source (provided it is sorted).
A Joiner can not lookup newly inserted rows, a (dynamic) Lookup can.
A Lookup can have persistent caches, which can be used to improve the performance of succeeding jobs.
2. What are persistent caches and what are the advantages and disadvantages of using them?
A Persistent Cache is a cache file that is not deleted after a session run. This file can therefore be reused by succeeding jobs, removing the need to rebuild the cache, and thereby greatly improving performance and greatly reducing the run time.
If records are inserted or updated into the table in between session runs and the cache file isn’t rebuilt (since it is persistent), the lookups might return incorrect (i.e. old) values.
To avoid this, use dynamic lookups to update the cache whenever an insert or update is expected.
3. Is this the only use of dynamic lookups? To update persistent cache files? Also, should dynamic lookups be persistent?
Not all dynamic lookups need to be persistent. A cache can be dynamic (meaning it changes during a session run) but not persistent (meaning, it will be deleted after the session run).
Dynamic lookups, whether persistent or not, are also used in mappings that update or insert records into the table being looked up on. That way, where during the session run the table is changed, the cache files are updated correspondingly.
4. How would you handle duplicate records?
Use a Sorter and check the Distinct option.
Use an Aggregator and check the Group By fields for the keys.
The problem with the first two solutions is that the duplicate records are lost. To keep duplicate records, use a Sorter, then an Expression, then a Router. In the Expression, use the following ports: V_CURR_VALUE=VALUE then DUPLICATE_FLAG=IIF(V_CURR_VALUE=V_PREV_VALUE, 1, 0) then V_PREV_VALUE=VALUE in this order.
5. What are the types of Lookups? Lookup caches?
Connected and Unconnected Lookups. Unconnected lookups can only return one lookup port, while Connected lookups need to be connected to the pipeline.
Cache Enabled and Cache Disabled Lookups
For caches: Static (cache files are not updated during the session run) or Dynamic (cache files are updated during the session run), and Persistent (cache files are not deleted after session run) or Non-Persistent (cache files are deleted after session run).
6. How do you handle Performance issues?
Enable Collect Performance Data in the session properties window, run the job, and review the logs. Look for busy areas and bottlenecks.
Add partitions to create extra read, write, and transform threads, to speed up the entire process.
Run several “streams” of the same mapping at the same time, instead of letting one session/mapping handle the entire load.
Use persistent caches on large tables. If necessary, create a recache job to build the cache first before running the actual job.
7. What is the difference between a Source Qualifier and a Joiner?
A source qualifier can join two sources from the same database, while a Joiner can join heterogenous sources (different databases, or even flat files).
A source qualifier must be next to a source, while a Joiner can perform the join anywhere in the mapping.
A source qualifier can have a specified SQL override. A Joiner can only have pre-defined join types.
8. How would you transpose, say, a 3 by 3 table?
Use a sequence transformation to create an index and determine which row number is being analyzed.
Connect the 3 source columns to a router with the generated NEXTVAL. Create three router groups, NEXTVAL=1, 2, and 3.
Connect each output to a Normalizer with a port that occurs 3 times. Join the three Normalizers using two Joiners (sorted input) using the GK_ID field as condition.
The final output is the transposed table.
9. What is data warehousing and what are its advantages?
(source: Wikipedia) Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.
Data warehouses are designed in such a way that the task of obtaining any information becomes simpler or easier.
This is done with the help of either Normalized tables, or Facts and Dimensions.
10. What are Facts and Dimensions? What is the difference between a Fact and a Dimension?
Facts are transactional tables that contain measures – data that can be analyzed, aggregated or summarized. An example would be a table that handles each sales transaction of a store. One transaction per record.
Dimensions are reference tables that contain additional information regarding a specific field. For example, a table that handles customer information (name, address, etc).
In a Star Schema, the Facts contain keys to difference Dimensions, in addition to the measures. To get the whole picture, one simply has to “look up” the Dimension tables using these keys. To complete the previous example, the Fact table would have: Customer_ID, Items_Purchased_ID, Transaction_Amount. We can then use the Customer_ID and look up the Dimension to get the customer’s name, address, etc.
In a Snowflake Schema, some Dimensions have keys that point to other Dimensions. The Facts remain the same.
Some Dimensions remain constant over a long time, that is, they change slowly. These are called Slowly Changing Dimensions (SCD), and are handled in several different ways.
11. What are the types of Slowly Changing Dimensions (SCD)?
SCD Type 0 are dimensions that contain records that do not change. Instead, new records are created, meaning, no records are updated, all new records are inserted.
SCD Type 1 are dimensions that do not keep historical data. New records are inserted if they do not exist, else, they are updated.
SCD Type 2 are dimensions that keep historical data by using extra columns (usually VERSION_START_DATE, VERSION_END_DATE, and CURRENT_INDICATOR). When a new “version” is inserted, the old version is invalidated but not replaced.
SCD Type 3 are dimensions that keep historical data by storing them in extra columns (i.e. PREV_SALARY, CURR_SALARY). This usually keeps a limited, pre-defined number of old values before they are eventually overwritten.
SCD Type 4 are dimensions that keep historical data by storing them in an extra table. New data are stored in a current, “status” table.
SCD Type 6 is a combination of types 1, 2, and 3.
No comments:
Post a Comment