Arenadata Unveils ADB-Spark Connector for Data Exchange between Greenplum and Spark

Arenadata has released ADB-Spark Connector, a new product to exchange data between Arenadata DB (ADB, a Greenplum-based analytical MPP DBMS) and Apache Spark (a distributed data processing framework, which is part of the Hadoop ecosystem).
Greenplum Spark Connector for data exchange
ADB-Spark Connector is designed for fast and parallel information transmission between Spark and Arenadata DB. Previously, data export and import tasks were solved with Greenplum Platform Extension Framework (PXF) to some extent.

The connector uses Scala 2.11.x, 2.12.x, Twitter Finagle, and ScalikeJDBC, and runs on an HTTP server via the gpfdist protocol. Unlike other existing ADB exchange methods, this one enables parallel writing to Greenplum segments without Master participation, supports flexible partitioning when reading data from Greenplum to Spark, does not require installing the gpfdist utility on each Spark node, and offers other advantages.

To employ gpfdist, the Finagle framework is used, which demonstrated better performance (compared to the initially selected Akka HTTP) in case of multiple simultaneous sessions from ADB segments.

ADB-Spark Connector main functions include:

  • Reading data from Greenplum to Spark with various partitioning methods supported;
  • Writing data from Spark to Greenplum using several write modes: Append, Overwrite, and ErrorIfExists;
  • Push-down operator support;
  • Extracting additional metadata from Greenplum, including statistics and data distribution schemes;
  • Automatic data scheme generation;
  • Optimizing the count aggregate function execution.

Dmitry Pluzhnikov, Director of System Architecture Department at Arenadata
Dmitry Pluzhnikov
Director of System Architecture Department at Arenadata

“Our solution will be useful to those customers who combine Arenadata Hadoop and Arenadata DB when building their corporate storages. ADB-Spark Connector enables fast bidirectional communication between them and therefore the most effective data reading and writing.”

Compared to Pivotal Spark-Greenplum connector, its closest commercially available rival, ADB-Spark Connector provides more flexible partitioning (five methods instead of two), offers more data types (including interval and array), and features extra functionality, such as support of Batch mode in Spark, statistics collection to build query plans using Catalyst, and arbitrary SQL query execution through an ADB Master node.

ADB-Spark Connector currently supports Spark 2.3.x and 2.4.x. The near-future plans include adding support for Spark 3.x and implementing streaming functionality.

Related more

all news
Attention! Check the fields you filled in are correct Email

We use cookies to enhance your experience, for analytics and to show you offers tailored to your interests on our site and third party sites. We may share your information with our advertising and analytics partners. By clicking "Accept", you agree to our use of cookies and similar technologies.