How will you choose various file formats for storing and processing data using Apache Hadoop ?


Please explain why do you think this question should be reported?

Report Cancel

The decision to choose a particular file format is based on the following factors-

i) Schema evolution to add, alter and rename fields.

ii) Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.

iii)Splittability to be processed in parallel.

iv) Read/Write/Transfer performance vs block compression saving storage space

File Formats that can be used with Hadoop – CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.

CSV Files 

CSV files are an ideal fit for exchanging data between hadoop and external systems. It is advisable not to use header and footer lines 

Please follow and like us:

About the Author