Optimization Theory - sufiyan63/SAP-Hana-Cloud GitHub Wiki

Optimization Theory


  • Monitoring unfolding

    • Use Analyze ⇒ Explain plan in SQL console
  • Transparent Filter

    • Again used in aggregation node to get accurate results
    • Store A has product hat and sales 2000
    • Store B has product hat and sales 1000
    • If we request for number of unique product from store A and B
    • We get result 2 which is incorrect. The result should be 1
    • Select the Store Column in mapping section and enable Transparent Filter.
    • The result is one which is correct.
  • CHECK_ANALYTICAL_MODEL

    • tips to improve the optimization of object
    • Procedure with parameter
    • 'list'
    • schema name
    • object name
    • out
    • call CHECK_ANALYTICAL_MODEL('list','schema','object',?)
  • Optimize Join

    • Optimize Join Columns Option

      • In SAP HANA is a configuration setting used in calculation views to improve performance when joins are executed on specific columns. When this option is enabled for a join column, it provides the SAP HANA engine with additional metadata about the column, such as its usage frequency or distribution. This helps the engine optimize query execution plans.
      • Steps:
      • Open Calculation View Editor
      • Select Join Node
      • Select the join column from one of the tables in the join.
      • Right-click the column and look for the option Optimize Join Column or a similar setting in the column properties.
      • By default the joint field like employee_ID, Adress_ID is included in the aggregation even if not requested by the query
      • When the Optimize Join Columns option is active, pruning of join columns between two data sources, A and B, occurs when all four following conditions are met:
      • The cardinality on B side (the side of the join partner from which no column is requested) is ..1
      • The join column from A is NOT requested by the query.
      • Only columns from one join partner, A, are requested.
      • The join type is Referential, Outer or Text
      • No measures at all are requested or only measure with count distinct aggregation
    • Greedy Join Pruning

      • If you want to prune the joints at any cost irrespective of the joint type and cardinality But force Pruning by left, right and both
      • It allows the SQL optimizer to prune a joined source if it is not queried at all (no column requested), regardless of the cardinalities and join type settings.
      • When greedy pruning is enabled, it does not prevent other join pruning mechanisms. Therefore, even if a condition is not met for greedy join pruning (for example, the Greedy Pruning is Left but you request columns from the left table of the join), join pruning could still happen because the cardinalities, join type and so on, allow it.
      • The results might be affected you need to thoroughly test it
      • Select the join link
      • Select the join type ⇒ inner
      • Select cardinality
      • Enable Greedy pruning ⇒ both sides
      • Another way which takes high priority
      • Select the semantics node
      • Select view properties
      • Select advanced tab
      • Select Execution hint
      • Greedy Pruning both sides
      • Greedy Pruning at Different Levels
  • Parallelization

    • SAP Hana Cloud optimizes performance through automatic parallelization
    • Forced Parallelization
    • Use "Partition local Execution" flag in calculation views select the table node in mapping tab
    • Data Source must be a table and not a view
    • Only one parallelization block per query
    • Start with table defined node – Set the flag for Partition local Execution
    • Perform aggregation and calculation in the middle node
    • End with Union node, Set the flag for Partition local Execution, Indicating termination of parallelism, Note Union node should have only one data source
  • Performance analysis

    • Button in Graphical calculation view editor
    • In each node you will get performance analysis tab
    • Identify settings leading to poor performance during calculation view development
    • React to design time warnings and information to optimise performance
    • Access detailed information on performance analysis tab for all nodes except Symantec node
    • Information Provided

      • Table types (row or column)
      • Joint types (inner or outer) and join cardinality
      • Table partitioning details
      • Clear warnings for very large table
      • missing or misaligned cardinalities
      • Consider partitioning and applying filters for large tables to process only necessary data
  • Plan and SQL Analyzer

    • Provides brief info about operators and related information
    • Analyse performance of SQL queries, including those generated by calculation views
    • gain insights into query execution type, step duration, bottlenecks, parallelization, and processing engines
    • Provide views to analyse SQL queries. Drill down into query execution, timeline and table usage
    • Requires installation of SAP Hana Performance Tools Extension in sap business application studio
    • Select the .hdbcalculationview file and open SQL editor
    • Select Hana SQL analyzer tab in the left pane
    • Click on Analyze and select Generate SQL Analyzer file
    • The generated file is available in the main Hana instance
    • SAP Hana cloud central ⇒ Database explorer ⇒ DBADMIN
    • Expand Database Diagnostic Files ⇒ Expand host ⇒ other ⇒ download the generated file
    • Import that in BAS ⇒ open in SQL analyzer
  • Snapshots

    • Store calculation view result for history or performance reason
    • Take snapshots of data at different point of time
    • Semantics ⇒ View properties ⇒ Snapshots
    • New query name ⇒ create snapshot after deployment (will create snapshot table) ⇒ keep snapshot during redeployment
    • Deploy the snapshot view individually to generate the snapshot table
    • Deploy the calculation view to generate the snapshot table
    • You can also find the procedures generated for the snapshot table (Insert, Truncate, drop, delete)
    • execute snapshot manually or scheduled for regular updates
    • Query snapshot table for historical data while using calculation view for real time data
  • Best Practices for Modeling

    • Always maintain accurate cardinality
    • Try to reduce the number of join columns
    • Avoid joining on calculated columns
    • Avoid type conversions at runtime
    • Avoid mixing serious CDS views with calculation views, because both are performed on the different engine
    • Break down large models into individual calculation views so that they are easier to understand and also allow you to define reusable components, thus avoiding redundancy and duplicated maintenance as a consequence.
    • For performance issues, we generally recommend that you create dedicated calculation views or tables to generate the list of help values for variables and input parameters.
    • Setting up and maintaining replication is possible by executing SQL statements in the SQL Console. But this approach means that the objects that are created in the database will be owned by your user ID and this can lead to problems when others need to take over the objects. Also, transportation of the database objects isn’t possible.
    • Best practice is to develop the objects using source files and then build/deploy to generate the database artifacts in an HDI container.
  • Column Pruning

    • Do not map columns that are not needed
    • Avoid adding too many columns which can lead to large granular data sets when only aggregated data is needed
  • Switch Calculation View Expressions to SQL

    • For SAP HANA on-premise, it is possible to write expression using either plain SQL or the native HANA language called column engine language. SAP recommends that you only use SQL and not column engine language to ensure best performance (Column Engine language limits query unfolding).
    • When you build calculation view expressions on SAP HANA Cloud, you can only choose SQL expression language. Column engine language is not available in SAP HANA Cloud.
    • Calculation views that were imported to SAP HANA Cloud from SAP HANA on-premise and include column engine language expressions will still run, but you should change the expression language to SQL from the column engine, to achieve better performance. When you select SQL language, the language selector is then grayed out so that you cannot return to Column Engine language.
    • You should be able to recreate column engine expressions to plain SQL. For example, the concatenate operator + is replaced with the plain SQL operator || or you can use the CASE keyword instead of If…Then.
  • Optimal Aggregations

    • Always reduce the data set by introducing aggregations early in the modelling stack.
    • If you consume a calculation view that produces attributes, do not change them to measures and vice versa.
    • Choose a cube with star join to build dimensional OLAP models. Do not try to create OLPA models using standard join nodes which creates a relational model. These may not perform as well.
  • Performance Analysis

    • You switch your calculation view to Performance Analysis mode by pressing the button that resembles a speedometer in the toolbar. When you do this, the Performance Analysis tab appears for all nodes except the semantics node. Choose this tab to view detailed information about your calculation view, in particular the settings and values that can have a big impact on performance.
    • Examples of the information presented on the Performance Analysis mode tab include the following:
    • The type of tables used (row or column)
    • Join types used (inner, outer, and so on)
    • Join cardinality of data sources selected (n:m and so on)
    • Whether tables are partitioned, and also how the partitions are defined, and how many records each partition contains
    • The size of tables with clear warnings highlighting the very large tables
    • Missing cardinalities or cardinalities that do not align with the current data set
    • Joins based on calculated columns
    • Restricted columns based on calculated columns
    • This information enables you to make good decisions that supports high performance. For example, if you observe that a table you are consuming is extremely large, you might want to think about adding some partitions and then apply filters so that you process only the partitions that contain data you need.
  • Controlling Parallelization in a Data Flow

    • SAP HANA Cloud always attempts to automatically apply parallelization to queries in order to optimize performance
    • Within a calculation view, it is possible to force parallelization of data processing by setting a flag Partition Local Execution to mark the start and also the end of the section of a data flow where parallelization is required. The reason you do this is to improve the performance of a calculation view by generating multiple processing threads at specific positions in the data flow where performance bottlenecks can occur.
    • The parallelization block begins with a calculation view node that is able to define a table as a data source. It is not possible to use any other type of data source such as a function or a view.
    • In the Properties of the chosen start node, a flag Partition Local Execution is set to signal that multiple threads should be generated from this point onwards. It is possible to define a source column as a partitioning value. This means that a partition is created for each distinct value of the selected column. For example, if the column chosen was COUNTRY, then a processing thread is created for each country. Of course, it makes sense to look for partitioning columns where the data can be evenly distributed. The partitioning column is optional. If it is not selected, then the partitioning defined for the table is used.
    • To end the parallelization block, you use a union node. However, unlike a regular union node that would always have at least two data sources, a union used to terminate the parallel processing block is fed only from one data source. The union node combines the multiple generated threads but the multiple inputs are not visible in the graphical editor and so the union node appears to have only one source from the node below.
    • In the Properties of the union node, a flag Partition Local Execution is set to signal the ending of the parallelization block.
    • Only one parallelization block can be defined per query. This means that you cannot stop and the start another block either in the same calculation view or across calculation views in the complete stack.
    • You cannot nest parallelization blocks, for example, start a parallelization block then start another one inside the original parallelization block.
    • It is possible to create multiple start nodes within a single parallelization block by choosing different partitioning columns for each start node. If you create multiple start nodes, then all threads that were generated are combined in a single ending union node.
    • To check the partitioning of the data, you can define a calculated column within the parallelization block with the simple column engine expression partitionid(). In the resulting data set, you will then see a partition number generated for each processing thread.
  • Table Partitioning

    • it is also possible to subdivide the rows of column tables into separate blocks of data.
    • Partitioning only applies to column store tables and not row store tables. This is why it is recommended that you define tables as a column store that will be used for querying large data sets. Tables as a column store that will be used for querying large data sets.
    • Generating partitions is usually the responsibility of the SAP HANA Cloud administrator. There are many decisions to be made relating to partitions including the type of partition. For example, hash, round robin, or range. Partitions can also include sub-partitions. The administrator uses monitoring tools to observe the performance of partitions and makes adjustments. What the modeler will need to do is to provide the information relating to the use of the data – for example, how queries will access the data so that the partitions can be defined optimally.
  • Caching View

    • Reduce system load and speedup query execution by caching complex calculation view results
    • Decreased CPU and memory consumption
    • Applied to highly processed and reduced data example aggregations and filtering
    • Caching shouldn't be applied to raw data but to data that has been highly processed and reduced through aggregations and filtering. In practice, this means applying caching to the top-most nodes of a calculation view.
    • Cache can only be used for calculation views that don't check for analytic privileges. This means analytic privileges should be defined on the top-most view only, in a calculation view stack. The optimal design would be to define the cache setting at the highest calculation view possible in the stack, but not at the very top where analytic privileges are checked. This means that the user privileges are always checked against the cached data, if the cache is useable.
    • Firstly, calculation view caching can be defined at the column level. This means queries that use only a sub-set of the cached columns can be served by the cache. It's recommended to cache only the columns that are frequently requested in order to reduce memory consumption and speed up cache refresh.
    • If you don't specify any columns in the cache settings, then all columns are cached.
    • The cache size can be further reduced by defining filters in the cache settings. Queries that use either exactly the cache filters, or a subset of the filters, are served by the cache. Cache is only used if the query filter matches exactly the filter that is defined for the cache, or reduces the data further.
    • It's currently not possible to specify that cache should be invalidated if new data is loaded to the underlying tables. This feature is expected to come later. For now, we just have a time-based expiry (in minutes).
    • Semantics ⇒ Static cache
    • Firstly, the basic setting Enable Cache must be set in the calculation view to be cached.
    • Then, the query must be able to be fully-unfolded. This means that the entire query must be fully translatable into a plain SQL statement and not require the calculation of interim results. You can use the Explain Plan feature of the SQL Analyzer to check if unfolding can occur.
    • There must be no calculations that depend on a specific level of calculated granularity as the cache is stored at the level of granularity requested by the first query. This may not align to the aggregation level required by subsequent queries and could cause incorrect results.
    • Hint is added to your SQL Query:
    • SELECTWITH HINT (RESULT_CACHE)
    • A database hint is used in the top-most view, which consumes to-be-cached view
    • View A: A complex calculation view that processes raw data.
    • View B: Another calculation view that aggregates and filters data from View A.
    • View C: The top-most view that uses the results from View B.
    • Enable caching for View B to store its results.
    • Use a database hint in View C to utilize the cached results of View B.
    • Configuration parameter is set:
    • indexserver.ini -> [result_cache] -> enabled
  • Calculation View Snapshots

    • To store historical values. Or you could need performance but not on real-time data, just on a latest snapshot of the system taken in the morning, for example.
    • First, you need to deploy your calculation view to generate the snapshot table.
    • Procedures are also generated to truncate the snapshot table or to trigger insertion of data.
    • You can execute the snapshot manually or schedule it to get new data regularly.
    • You can then query the snapshot table to get historical data.
    • You can always continue using the calculation view as usual to get real-time data.
    • Steps
    • Open your calculation view
    • Go to the view properties tab and select the snapshot tab
    • Add a new query
    • Define the SELECT statement for the query
    • Give me a name to your query
    • Decide if you want to provision the table or deployment
    • Decided which mode you want the procedures to be created in
    • Decide if you want an interface view
    • INSERT
    • The INSERT procedure can be used at any time to execute the query and store the results into the snapshot table.
    • If you call the procedure to insert data twice, without truncating data, you'll have duplicates. So first call TRUNCATE Procedure and then execute INSERT
    • If you don't want to delete data between two deployments, you can use the Keep Snapshot During Re-deployment option.
    • Interface View

      • Additional calculation view is generated which is a union of snapshot and original calculation view
      • Additional Column called 'SOURCE' is generated in the Interface View which has constant values 'BASE' and 'SNAPSHOT'
      • Additional parameter called 'I_SOURCE' is added to the interface view which is a static list of filter values
      • Additional Filter expression is added to filter the source
  • MDS Cubes for a Calculation View

    • store pre-processed data in a format that can serve all these complex drill-down queries and allows to quickly answer these most frequently required drilldown situations
    • You can define multiple MDS Cubes for the same calculation view. Deployment of the calculation view builds the view and generates all MDS cubes that are defined for it, initially without data.
    • the data is not stored as a flat datastructure, but in a multidimensional, star-join-like structure that is optimized for tailored data access.
    • When the source data changes after populating the MDS cube, the loaded data is not automatically updated.
    • Analytic privilege settings of a calculation view are applied automatically also to all MDS Cubes that apply to within the calculation view
    • An API is available to manage and maintain MDS Cubes. This API implements an extended CRUD (Create, Read, Update, Delete) service, which can be used by both applications using MDS Cube and directly by the customer.
    • CALL MANAGE_MDS_CUBE('
      {
      "Cube": {
      "Command": "Reload",
      "Target": {
      "DataSource": {
      "SchemaName": "LIQUID_HORIZON",
      "ObjectName": "model.mdscube::LIQUID_HORIZON_LTD/mdscube/LIQUID_HORIZON_MDS_CUBE"
      }
      }
      }',?);
    • For monitoring, check the SYS_EPM.MDSMETADATA table
    • Limitations
    • MDS Cubes cannot contain calculated measures. Instead, you can generate the calculation in SAC.
    • MDS Cubes support variables, but no input parameters
    • Joins with two or more join condition fields are not supported.
    • VARBINARY columns can not be used.
    • Authorizations
    • To load or delete an MDS cube, you need the EXECUTE privilege for procedure MANAGE_MDS_CUBE. You can grant it as shown in the code snippet:
    • GRANT EXECUTE ON MANAGE_MDS_CUBE TO <DB_USER>;
    • Additionally, you need privilege CREATE ANY for the HDI schema in which the Calculation view of the MDS Cube resides. Moreover, you must have SELECT privileges for the data source of the MDS Cube.
  • Query Unfolding and folding

    • Query unfolding is when SAP HANA ignores the pre-designed structure of the calculation view and then runs the query directly on those raw tables instead of using the calculation view’s built-in logic.
    • A benefit of unfolding to SQL is that only the SQL engine is needed and no other SAP HANA engine is called, such as the Calculation Engine or other SAP HANA Cloud engines such as Spatial or Graph
    • If a column view is used as a data source instead of a table, then you know this is not an unfolded query as the result has been materialized as an interim view and not read directly from a table. A completely unfolded query only accesses tables.
    • The calculation view is designed to show a summary, like “how many activities each student does.” But suppose you run a query asking for something very specific, like “all students with a particular activity on a specific date,” and the calculation view doesn’t have that info pre-organized. SAP HANA might unfold the query. It goes straight to the raw Students and Enrichment tables, joins them (e.g., matching STUDENTS_ID = STU_ID), and filters the data step by step.
    • Query folding is when SAP HANA uses its pre-designed, optimized structure to get the results. It doesn’t break things down to the raw tables—it trusts the “recipe” already built into the view.
    • Using the same calculation view with Students and Enrichment tables, let’s say you want to know “how many activities each student does.” The calculation view might already have this summary pre-calculated or optimized. When you run this query, SAP HANA folds it—meaning it stays within the calculation view (e.g., one called exampleNoKAnonymity) and quickly gives you the answer without touching the raw tables.
⚠️ **GitHub.com Fallback** ⚠️