Last updated on Jan 16, 2026

Supported Data Processing Engines

In this guide, we will explore the core concepts and capabilities of Databricks and Livy, two powerful external processing engines. These engines are widely used for large-scale data processing and analytics. Understanding how they work will empower DataFuse AI users to maximize their data workflows and enhance their processing capabilities.

1. Databricks Overview

What is Databricks?

Databricks is a unified analytics platform that provides an end-to-end solution for data engineering, machine learning, and analytics. Built on top of Apache Spark, it offers a collaborative environment to handle big data processing at scale. It provides cloud-based tools for data engineers and scientists to run complex computations, build machine learning models, and collaborate on data projects.

Key Concepts in Databricks

1.1 Apache Spark on Databricks

At the core of Databricks is Apache Spark, a distributed computing engine designed for processing large volumes of data. Spark enables both batch and streaming data processing.

Batch Processing: Databricks can execute large-scale, non-real-time computations, such as data transformations and aggregations.
Stream Processing: Databricks can also process real-time streaming data, making it suitable for time-sensitive data pipelines (e.g., fraud detection or recommendation engines).

1.2 Databricks Notebooks

Databricks Notebooks provide an interactive environment where data engineers, analysts, and scientists can collaborate. They support various languages such as SQL, Python, R, and Scala.

Interactive Data Science: Users can write and execute code, visualize results, and document their process within the same interface.
Collaboration: Multiple users can work on the same notebook, making it easy to share insights and build workflows together.

1.3 Databricks Clusters

A cluster in Databricks is a set of computational resources and configurations used to run Spark jobs. You can choose from various cluster types, including:

Interactive Clusters: Used for running interactive data science notebooks.
Job Clusters: Used for scheduled, production-level tasks and batch processing.

Clusters can be scaled up or down based on workload requirements, ensuring efficient resource utilization.

1.4 Databricks Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It allows you to build reliable, scalable, and high-performance data lakes.

ACID Transactions: Ensures data consistency and reliability.
Schema Enforcement and Evolution: Provides schema validation and automatic schema evolution for ingested data.

How to Use Databricks for Data Processing

Create a Databricks Workspace: Set up your Databricks workspace in a cloud environment (Azure, AWS, or Google Cloud).
Create Clusters: Create clusters within Databricks to handle specific processing jobs. You can configure clusters with varying resources based on your data processing needs.
Run Notebooks: Create and execute interactive notebooks where you can write code in Python, R, Scala, or SQL to manipulate data, perform transformations, and visualize results.
Integrate with External Data Sources: Databricks integrates seamlessly with cloud storage (e.g., AWS S3, Azure Data Lake) and databases (e.g., PostgreSQL, MySQL).
Utilize Delta Lake: Use Delta Lake to ensure your data lake is reliable and optimized for both batch and real-time processing.

2. Livy Overview

What is Livy?

Livy is an open-source REST interface for Apache Spark that allows users to interact with a Spark cluster remotely via HTTP. It provides a simple API for submitting jobs, running queries, and managing Spark sessions, making it easier to integrate Spark with other applications.

Key Concepts in Livy

2.1 Spark Sessions in Livy

A Spark session in Livy is a connection to a Spark cluster that allows users to run Spark jobs interactively. When you submit a job to Livy, a Spark session is created to manage the job’s execution.

Interactive Jobs: Submit interactive queries to the Spark cluster via Livy’s REST API.
Batch Jobs: Livy can submit batch jobs that run on Spark clusters without requiring direct access to the Spark environment.

2.2 Submitting Jobs via REST API

Livy exposes a RESTful API, which allows you to submit jobs to Spark clusters remotely. Jobs can be written in Scala, Python, R, or SQL, and Livy abstracts away much of the complexity of interacting with the Spark cluster directly.

Submit Jobs: Submit Scala, Python, R, or SQL jobs via a simple HTTP request.
Job Management: Once a job is submitted, Livy provides a job ID that can be used to track the status of the job and retrieve the results.

2.3 Livy Sessions and Jobs

Session Management: Livy maintains Spark sessions, allowing you to reuse sessions for multiple jobs.
Job Execution: Once the session is ready, you can execute one or more jobs in that session.

2.4 Authentication and Security

Livy supports multiple authentication mechanisms to secure access to Spark clusters, including:

None: No authentication (typically used for local or unsecured setups).
NGINX: Reverse proxy authentication using NGINX.
LDAP: Integration with LDAP for user-based authentication.

How to Use Livy for Data Processing

Set Up Livy: Install Livy on your Spark cluster (either on-premises or in the cloud). Ensure that Livy is configured with the necessary authentication mechanisms (NGINX, LDAP) as per your organization’s requirements.
Start a Livy Session: Use the Livy API to create a session. This session will allow you to run Spark queries and jobs remotely.
Submit Spark Jobs: Once the session is established, you can submit Spark jobs (written in Scala, Python, R, or SQL) through the Livy API.
Track Job Progress: Use the job ID to track the progress of your jobs. Livy will provide real-time status updates on job execution.
Retrieve Results: After a job completes, retrieve the results from Livy’s API. Livy stores job results that can be accessed via HTTP requests.

Comparison: Databricks vs. Livy

Feature	Databricks	Livy
Platform Type	Managed cloud platform	Open-source REST interface for Spark
Cluster Management	Fully managed clusters with auto-scaling	User-managed Spark clusters
Ease of Use	Highly user-friendly with a web interface (Notebooks)	Requires more setup and interaction via APIs
Supported Languages	Python, Scala, R, SQL, MLlib, TensorFlow	Python, Scala, R, SQL
Real-Time Streaming	Supports real-time stream processing with Structured Streaming	Supports batch processing; can be used for streaming with some setup
Integration	Integrates with cloud platforms (AWS, Azure, GCP), databases, and data lakes	Integrates with existing Spark clusters
Authentication	Supports OAuth, Tokens, and Azure Active Directory	Supports NGINX and LDAP
Data Lake Support	Seamless integration with cloud-based data lakes (e.g., Delta Lake)	No built-in data lake support; relies on external configuration

Conclusion

Both Databricks and Livy are powerful tools for big data processing, offering different capabilities and configurations for users.

Databricks is a fully-managed platform offering easy-to-use tools for data engineering, machine learning, and analytics, while also handling all the infrastructure concerns for you.
Livy is a flexible, open-source interface for remotely submitting Spark jobs to existing Spark clusters, allowing greater control and integration possibilities, though requiring more setup.

As a DataFuse AI user, leveraging these engines allows you to process large datasets, perform advanced analytics, and build complex machine learning models with ease. Whether you choose Databricks for a managed, fully integrated solution or Livy for more control over your Spark cluster, both engines can significantly enhance your data processing workflows.

Need Help Setting Up Your Engine?

Follow the steps outlined in our Quick Start Guide - Setup Your First Engine to set up your first engine using Databricks or Livy.

1. Databricks Overview​

What is Databricks?​

Key Concepts in Databricks​

1.1 Apache Spark on Databricks​

1.2 Databricks Notebooks​

1.3 Databricks Clusters​

1.4 Databricks Delta Lake​

How to Use Databricks for Data Processing​

2. Livy Overview​

What is Livy?​

Key Concepts in Livy​

2.1 Spark Sessions in Livy​

2.2 Submitting Jobs via REST API​

2.3 Livy Sessions and Jobs​

2.4 Authentication and Security​

How to Use Livy for Data Processing​