Moving relational data into Hadoop isn’t a slam dunk. To avert programming complexities, a major Texas university turned to commercial tools to stock its Hadoop data lake.
Data lakes, with all kinds of unconventional new data types, are generating buzz these days. But turning a Hadoop data lake into something useful can mean importing a data type that is neither new, nor out of the ordinary: relational data.
After all, users want to combine traditionally structured data with newer unstructured data for analysis in the data lake, while also gathering together enterprise data from silos beyond their own department’s boundaries.
Relational data is more familiar, but it isn’t necessarily a slam dunk for Hadoop data lake ingestion, as a large Texas school has found. The general message there is open source Hadoop components that require special skills may not be the best way to fill in a data lake with SQL data. More specifically, Apache Sqoop, as a common open source Hadoop component that can pull relational data into a Hadoop data lake, may hit some barriers in some organizations.
“We started out our data ingestion with Apache Sqoop, but it required that we do custom coding,” said Juergen Stegmair, who leads the database administration team at the University of North Texas (UNT).
Stegmair said that Sqoop’s command-line-based programming required considerable coding in a style unfamiliar to his staff. Coding wasn’t beyond them, but UNT’s overall aim was to “avoid custom programming as much as possible,” he said.
Shifting lakes of data
Shifting data to Hadoop data lakes is still a new experience for many teams, especially in IT shops within public universities, so adhering to a plan was important. That plan to achieve what Stegmair described as “a forward-looking architecture” began to germinate two years ago.
The architecture would incorporate open source Hadoop technology and add semistructured and unstructured data to the school’s analytics portfolio. One pressing issue was higher velocity of data intake.
UNT, which is located in the city of Denton, opted to build a Hadoop data lake using software from Hadoop distribution provider Hortonworks. Strategic planning was followed by a first-stage implementation, which UNT began in September 2016. This focused on integrating data from existing SQL Server and Oracle databases.