Apache CarbonData Documentation

Apache CarbonData is a new big data file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, which helps in speeding up queries by an order of magnitude faster over PetaBytes of data.

Getting Started

File Format Concepts: Start with the basics of understanding the CarbonData file format and its storage structure. This will help to understand other parts of the documentation, including deployment, programming and usage guides.

Quick Start: Run an example program on your local machine or study some examples.

CarbonData SQL Language Reference: CarbonData extends the Spark SQL language and adds several DDL and DML statements to support operations on it. Refer to the Reference Manual to understand the supported features and functions.

Programming Guides: You can read our guides about Java APIs supported or C++ APIs supported to learn how to integrate CarbonData with your applications.

Integration

  • CarbonData can be integrated with popular execution engines like Spark , Presto and Hive.
  • CarbonData can be integrated with popular storage engines like HDFS, Huawei Cloud(OBS) and Alluxio.
    Refer to the Installation and Configuration section to understand all modes of Integrating CarbonData.

Contributing to CarbonData

The Apache CarbonData community welcomes all kinds of contributions from anyone with a passion for faster data format.Contributing to CarbonData doesn?t just mean writing code. Helping new users on the mailing list, testing releases, and improving documentation are also welcome.Please follow the Contributing to CarbonData guidelines before proposing a design or code change.

Compiling CarbonData: This guide will help you to compile and generate the jars for test.

External Resources

Wiki: You can read the Apache CarbonData wiki page for upcoming release plan, blogs and training materials.

Summit: Presentations from past summits and conferences can be found here.

Blogs: Blogs by external users can be found here.

Performance reports: TPC-H performance reports can be found here.

Trainings: Training records on design and code flows can be found here.