Large-scale data integration a key challenge

6th November 2015 By: Schalk Burger - Creamer Media Senior Deputy Editor

Traditional approaches to data management in organisations will have to change to deal with the challenges of big data and using distributed processing as part of companies’ information technology systems, says University of Arkansas at Little Rock master data management specialist Professor John Talburt.

Current paradigms and typical relational database management systems cannot effectively process the vast amounts of structured and unstructured data that companies want to process to produce near real-time and accurate information.

He says the traditional approach requires all records to be matched in a 1:1 ratio to an index for subsequent processing; however, with millions of records, no company has access to the sheer processing horsepower required to do this quickly.

“A single universal index is impractical for big data management, and lots of problems still have to be solved before companies can effectively manage big data. An iterative approach to indexing of data could be a potential answer and could also be used as part of distributed processing, where partial or focused processing of data happens within portions of a company’s infrastructure,” Talburt explains.

Distributed processing and master data management systems that can use these emerging-technology architectures are two key topics being explored in academia and research organisations, notes Talburt.

Meanwhile, he says ontology – the study of meaning, typically used to demonstrate correlation or relevance between discrete topics – could be part of the answer in trying to process large volumes of data in traditional relational database systems.

“Ontology can enable much more effective processing, especially of unstructured data, as selection criteria would not need to be based on identical attributes of data sources, but rather on the relevance of the source in relation to the processing query. Ontology can, thus, be used to determine the relevance outside of the typical structured one-to-one relations.”

Similarly, ontology-based distributed processing and mobile computing systems would be able to process portions of transitional data for relevance within the context of the user or the applications in operation, and thus provide coherent and useable information in near real time at these distributed points.

Parallel processing, which includes emerging distributed and conventional centralised processing of data, will be necessary to deal with the challenges of big data and business requirements. These and other emerging trends in big-data management and processing will drive changes in traditional relational database architectures and management, as more of the processing power of these systems is moved closer to the data sources, instead of being solely centrally processed, as is currently the case, notes Talburt.

Mobile computing and hive computing are examples of some of these trends, and changes could also include assigning processing capabilities to various portions of stored data, rather than trying to process all the data in the current “data lakes”.

“Companies will have to explore parallelism, as traditional processing and programming paradigms will not be able to provide the speed or efficiency required,” says Talburt, adding that the real change must happen in schools and tertiary courses to promote training in these new technologies and paradigms.