How will you choose various file formats for storing and processing data using Apache Hadoop ?Report
The decision to choose a particular file format is based on the following factors-
i) Schema evolution to add, alter and rename fields.
ii) Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.
iii)Splittability to be processed in parallel.
iv) Read/Write/Transfer performance vs block compression saving storage space
File Formats that can be used with Hadoop – CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.
CSV files are an ideal fit for exchanging data between hadoop and external systems. It is advisable not to use header and footer lines