Overview:
Informatica Big Data Management enables the companies to process large, diverse, and fast changing data sets so we can get clear understanding into the data. Use Big Data Management to execute big data integration and transformation without writing or maintaining external code.
Use Informatica BDM to collect diverse data faster, build business logic in a visual environment, and take out hand-coding to get insights on the data.
We can consider implementing a big data project in the following scenarios.
The volume of the data that you want to process is greater than 10 terabytes.
You need to analyze or capture data changes in microseconds.
The data sources are changed and range from unstructured text to social media data.
We can perform run-time processing in the native environment or in a non-native environment. The native environment is the Informatica domain where the Data Integration Service performs all run-time processing. Use the native run-time environment to process data that is less than 10 terabytes. A non-native environment is a distributed cluster outside of the Informatica domain, such as Hadoop or Databricks, where the Data Integration Service can push run-time processing. Use a non-native run-time environment to optimize mapping performance and process data that is greater than 10 terabytes.
Advantages:
In this vast data sphere of Hadoop and NoSQL, the fame has moved on big data processing engines. At present there are many big data engines making it inconvenient for the organizations to opt the right processing engine for their big data integration specifications. Spark is a new technology that’s enabling new use cases such as machine learning, graph processing and providing scale for big data.
However, our latest big data management product shows 2-3 times faster performance over Spark for batch ETL processing by using Informatica Blaze on YARN.
- Performance: Performance is an important factor to assess for any big data processing engine.
Concurrency– Running concurrent jobs at the same time is obvious in data integration scenarios. Spark has some limitations as relates to concurrent job execution. Therefore, any data integration tool which packages only Spark will have issues.
Memory use– There are use cases for in-memory processing engines such as Spark but not all use-cases are fit for Spark processing. Since Spark needs lot of memory, it loads a process into memory and keeps it there until further notice, for sake of caching. Spark is well suited for use-cases like interactive queries and machine learning processes that needs to move over the same data many times.
Performance benchmark– TPC benchmarks are the best way to provide a vendor-neutral assessment for the performance and price-to-performance ratio.
- Layer of abstraction:The development of YARN for resource management capability and the migration from the MapReduce programming framework to Spark as the new processing engine were all recent developments. The answer to question of whether a software vendor supports an abstraction layer, you can future proof your big data management platform against changing big data technologies.
- Extent of functionality:
– Big data is much more than just big data integration.
– To be successful at a big data project, a software vendor must talk to you about the big data management framework, which consists of big data integration, big data governance, and big data security.
-Breadth of functionality is a key factor when talking to vendors.
– Look for software that provide functionality for your entire company ,from the business analyst who needs to profile data on Hadoop or provide data quality rules or governance for the big data landscape, to a developer who needs to be able to parse complex files or build complex transformations.
Unfortunately, this benchmark was misleading for the following reasons:
-Benchmark compares Disk I/O intensive map-reduce engine against in-memory spark engine on a single memory optimized Amazon Instance (VM).
– The use case was executed using only 12 million records on a cluster with 4 CPUs, 30.5 GB Memory, and 200 GB Storage, which is hardly representative of a real-world big data environment.