HADOOP Interview Questions and Answers for experienced pdf :-

1. Define What is Big Data?

Any data that cannot be stored into traditional RDBMS is termed as Big Data. As we know most of the data that we use today has been generated in the past 20 years. And this data is mostly unstructured or semi structured in nature. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not.

2. Define What do the four V’s of Big Data denote?

IBM has a nice, simple explanation for the four critical features of big data:
a) Volume –Scale of data
b) Velocity –Different forms of data
c) Variety –Analysis of streaming data
d) Veracity –Uncertainty of data

3. How big data analysis helps businesses increase their revenue? Give example.

Big data analysis is helping businesses differentiate themselves – for example Walmart the world’s largest retailer in 2014 in terms of revenue - is using big data analytics to increase its sales through better predictive analytics, providing customized recommendations and launching new products based on customer preferences and needs. Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue. There are many more companies like Facebook, Twitter, LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc. using big data analytics to boost their revenue.

4. Name some companies that use Hadoop.

Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)
Facebook
Netflix
Amazon
Adobe
eBay
Hulu
Spotify
Rubikloud
Twitter

5. Differentiate between Structured and Unstructured data.

Data which can be stored in traditional database systems in the form of rows and columns, for example the online purchase transactions can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Facebook updates, Tweets on Twitter, Reviews, web logs, etc. are all examples of unstructured data.

HADOOP (BIG DATA) Interview Questions

6. On Define What concept the Hadoop framework works?

Hadoop Framework works on the following two core components-

1)HDFS – Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture.

2)Hadoop MapReduce-This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples. The reduce job is always performed after the map job is executed.

7) Define What are the main components of a Hadoop Application?

Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems.

Core components of a Hadoop application are-

1) Hadoop Common

2) HDFS

3) Hadoop MapReduce

4) YARN

Data Access Components are - Pig and Hive

Data Storage Component is - HBase

Data Integration Components are - Apache Flume, Sqoop, Chukwa

Data Management and Monitoring Components are - Ambari, Oozie and Zookeeper.

Data Serialization Components are - Thrift and Avro

Data Intelligence Components are - Apache Mahout and Drill.

8. Define What is Hadoop streaming?

Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers.

9. Define What is the best hardware configuration to run Hadoop?

The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low - end. ECC memory is recommended for running Hadoop because most of the Hadoop users have experienced various checksum errors by using non ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly.

10. Define What are the most commonly defined input formats in Hadoop?

The most common Input Formats defined in Hadoop are:

Text Input Format- This is the default input format defined in Hadoop.
Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
Sequence File Input Format- This input format is used for reading files in sequence.

Hadoop HDFS Interview Questions and Answers :-

11. Define What is a block and block scanner in HDFS?

Block - The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.

Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

12. Explain the difference between NameNode, Backup Node and Checkpoint NameNode.

NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-

fsimage file- It keeps track of the latest checkpoint of the namespace.

edits file-It is a log of changes that have been made to the namespace since checkpoint.

Checkpoint Node-

Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.

BackupNode:

Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.

13. Define What is commodity hardware?

Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.

14. Define What is the port number for NameNode, Task Tracker and Job Tracker?

NameNode 50070

Job Tracker 50030

Task Tracker 50060

15. Explain about the process of inter cluster data copying.

HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.

16. How can you overwrite the replication factors in HDFS?

The replication factor in HDFS can be modified or overwritten in 2 ways-

1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-

$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)

2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-

3)$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)

17. Explain the difference between NAS and HDFS.

NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.

18. Explain Define What happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.

Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.

19. Define What is the process to change the files at arbitrary locations in HDFS?

HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.

20. Explain about the indexing process in HDFS.

Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

21. Define What is a rack awareness and on Define What basis is data stored in a rack?

All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.

The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.

Hadoop MapReduce Interview Questions and Answers :-

22. Explain the usage of Context Object.

Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output.

23. Define What are the core methods of a Reducer?

The 3 core methods of a reducer are –

1)setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc.

Function Definition- public void setup (context)

2)reduce () it is heart of the reducer which is called once per key with the associated reduce task.

Function Definition -public void reduce (Key,Value,context)

3)cleanup () - This method is called only once at the end of reduce task for clearing all the temporary files.

Function Definition -public void cleanup (context)

24. Explain about the partitioning, shuffle and sort phase

Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling.

Sort Phase- Hadoop MapReduce automatically sorts the set of intermediate keys on a single node before they are given as input to the reducer.

Partitioning Phase-The process that determines which intermediate keys and value will be received by each reducer instance is referred to as partitioning. The destination partition is same for any key irrespective of the mapper instance that generated it.

25. How to write a custom partitioner for a Hadoop MapReduce job?

Steps to write a Custom Partitioner for a Hadoop MapReduce Job-

A new class must be created that extends the pre-defined Partitioner Class.
getPartition method of the Partitioner class must be overridden.
The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop MapReduce or the custom partitioner can be added to the job by using the set method of the partitioner class.

26. Define What is the relationship between Job and Task in Hadoop?

A single job can be broken down into one or many tasks in Hadoop.

27. Is it important for Hadoop MapReduce jobs to be written in Java?

It is not necessary to write Hadoop MapReduce jobs in java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.

28. Define What is the process of changing the split size if there is limited storage space on Commodity Hardware?

If there is limited storage space on commodity hardware, the split size can be changed by implementing the “Custom Splitter”. The call to Custom Splitter can be made from the main method.

29. Define What are the primary phases of a Reducer?

The 3 primary phases of a reducer are –

1)Shuffle

2)Sort

3)Reduce

30. Define What is a TaskInstance?

The actual hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.

31. Can reducers communicate with each other?

Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm.

Hadoop HBase Interview Questions and Answers :-

32. When should you use HBase and Define What are the key components of HBase?

HBase should be used when the big data application has –

1)A variable schema

2)When data is stored in the form of collections

3)If the application demands key based access to data while retrieving.

Key components of HBase are –

Region- This component contains memory data store and Hfile.

Region Server-This monitors the Region.

HBase Master-It is responsible for monitoring the region server.

Zookeeper- It takes care of the coordination between the HBase Master component and the client.

Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.

33. Define What are the different operational commands in HBase at record level and table level?

Record Level Operational Commands in HBase are –put, get, increment, scan and delete.

Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

34. Define What is Row Key?

Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.

35. Explain the difference between RDBMS data model and HBase data model.

RDBMS is a schema based database whereas HBase is schema less data model.

RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.

RDBMS stores normalized data whereas HBase stores de-normalized data.

36. Explain about the different catalog tables in HBase?

The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.

37. Define What is column families? Define What happens if you alter the block size of ColumnFamily on an already populated database?

The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.

38. Explain the difference between HBase and Hive.

HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is an ideal choice for analytical querying of data collected over period of time.

39. Explain the process of row deletion in HBase.

On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.

40. Define What are the different types of tombstone markers in HBase for deletion?

There are 3 different types of tombstone markers in HBase for deletion-

1)Family Delete Marker- This markers marks all columns for a column family.

2)Version Delete Marker-This marker marks a single version of a column.

3)Column Delete Marker-This markers marks all the versions of a column.

41. Explain about HLog and WAL in HBase.

All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.

Hadoop Sqoop Interview Questions and Answers :-

42. How Sqoop can be used in a Java program?

The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.

43. Define What is the process to perform an incremental data load in Sqoop?

The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.

Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-

1)Mode (incremental) –The mode defines how Sqoop will determine Define What the new rows are. The mode can have value as Append or Last Modified.

2)Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.

3)Value (last-value) –This denotes the maximum value of the check column from the previous import operation.

44. Is it possible to do an incremental import using Sqoop?

Yes, Sqoop supports two types of incremental imports-

1)Append

2)Last Modified

To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.

45. Define What is the standard location or path for Hadoop Sqoop scripts?

/usr/bin/Hadoop Sqoop

46. How can you check all the tables present in a single database using Sqoop?

The command to check the list of all tables present in a single database using Sqoop is as follows-

Sqoop list-tables –connect jdbc: mysql: //localhost/user;

47. How are large objects handled in Sqoop?

Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store-

1)CLOB ‘s – Character Large Objects

2)BLOB’s –Binary Large Objects

Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object.

48. Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?

Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.

49. Differentiate between Sqoop and distCP.

DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS.

50. Define What are the limitations of importing RDBMS tables into Hcatalog directly?

There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.

100 [UPDATED] HADOOP Interview Questions with Answers for freshers experienced pdf free download:-

51.What is throughput? How does HDFS get a good throughput?
A.hroughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.

52.What is streaming access?
A.As HDFS works on the principle of Write Once, Read Many, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

53.What is a commodity hardware? Does commodity hardware include RAM?
A.Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware. We don't need super computers or high-end hardware to work on Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on RAM.

54.What is a Namenode?
A.Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.

55.What is a metadata?
A.Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.

56.What is a Datanode?
A.Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

57.What is a daemon?
A.Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is services and in Dos is TSR.

58.What is a job tracker?
A.Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is completed or not.

59.What is a task tracker?
A.Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another task tracker in the cluster.

60.What are the benefits of block transfer?
A.A file can be larger than any single disk in the network. Theres nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

61.How indexing is done in HDFS?
A.Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

62.Is client the end user in HDFS?
A.No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or datanode (task tracker).

63.When we send a data to a node, do we allow settling in time, before sending another data to that node?
A.we do.
64.Are Namenode and job tracker on the same host?
A.No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.

65.What is a heartbeat in HDFS?
A.A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

66.If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can the blocks be broken at the time of replication?
A.In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

67.Does hadoop always require digital data to process?
A.Hadoop always require digital data to be processed.

68.On what basis Namenode will decide which datanode to write on?
A.As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

69.Doesn't Google have its very own version of DFS?
A.Yes, Google owns a DFS known as Google File System (GFS) developed by Google Inc. for its own use.

70.What is a rack?
A.Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

71.On what basis data will be stored on a rack?
A.When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is for every block of data, two copies will exist in one rack, third copy in a different rack. This rule is known as Replica Placement Policy.

72.Do we need to place 2nd and 3rd data in rack 2 only?
A.Yes, this is to avoid datanode failure.

73.What if rack 2 and datanode fails?
A.If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default

74.If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default
A.The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

75.What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
A.In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

76.What is Key value pair in HDFS?
A.Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

77..If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size?
A.No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

78.Give examples of some companies that are using Hadoop structure?
A.A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook, eBay, Twitter, Google and so on.

79..What are some of the characteristics of Hadoop framework?
A.Hadoop framework is written in Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes). The programming model is based on Google's MapReduce. The infrastructure is based on Google's Big Data and Distributed File System. Hadoop handles large files/data throughput and supports data intensive distributed applications. Hadoop is scalable as more nodes can be easily added to it.

80.What is the meaning of speculative execution in Hadoop? Why is it important?
A.Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performaing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.

81.What is a IdentityMapper and IdentityReducer in MapReduce?
A.org.apache.hadoop.mapred.lib.IdentityMapper: Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer does not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value. ◦org.apache.hadoop.mapred.lib.IdentityReducer : Performs no reduction, writing all input values directly to the output. If MapReduce programmer does not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

82.If you only had 32 megabytes of memory how would you sort one terabyte of data?
A.What Most People Say: I don't know. (In fact, most candidates either get it right or don't). What You Should Say: Take a smaller chunk of data and sort it in memory, so you partition it in lots of little data sets. Then merge those sorted lists into one big list before writing the results back to disk. Why You Should Say It: Any candidate who does Hadoop or knows it at a deep level will be able to understand the depth of what Hadoop does, Sammer believes. It's a great qualifying question. It demonstrates an understanding of how you manage data at that scale.

83.Have you ever participated in open source in any way?
A.What Most People Say: No, or, I'm familiar with open source. What You Should Say: Here's an example of a project I did for a previous employer with open source. I have also contributed code. Why You Should Say It: Passion goes a long way, says Sammer. It gives us a high level gauge of interest in what they do for a living. People who do that tend to be a much better fit for us. Generally I as well as the rest of Cloudera believe there are a lot of ways to participate. You can contribute code, devote time by answering questions or write documentation. It's so impressive to see in a candidate.

84.What is BloomMapFile used for?
A.The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile. BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format. - See more at:

85.Does ILLUSTRATE run MR job?
A.No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows output of each stage and not the final output.

86.Is the keyword DEFINE like a function name?
A.Yes, the keyword DEFINE is like a function name. Once you have registered, you have to define it. Whatever logic you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler will check the function in exported jar. When the function is not present in the library, it looks into your jar

87.What co-group does in Pig?
A.Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the record of the first data set with the common data set and the second bag consists of the records of the second data set with the common data set.

88.Can we say cogroup is a group of more than 1 data set?
A.Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and join of that data set as well.

89.Can Reducer talk with each other?
A.No, Reducer runs in isolation

90.. How many maximum JVM can run on a slave node?
A.One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

91.How many daemon processes run on a Hadoop cluster?
A.Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM. Following 3 Daemons run on Master nodes.NameNode - This daemon stores and maintains the metadata for HDFS. Secondary NameNode - Performs housekeeping functions for the NameNode. JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes DataNode Stores actual HDFS data blocks. PappuPass Learning Resources 11 TaskTracker It is Responsible for instantiating and monitoring individual Map and Reduce tasks.

92.What do you mean by TaskInstance?
A.Task instances are the actual MapReduce jobs which run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the entire task tracker.Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

93.How many instances of Tasktracker run on a Hadoop cluster?
A.There is one Daemon Tasktracker process for each slave node in the Hadoop cluster.

94.How many Reducers should be configured?
A.The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * mapreduce.tasktracker.reduce.tasks.maximum). With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

95.What does job conf class do?
A.MapReduce needs to logically separate different jobs running on the same cluster. Job conf class helps to do job level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive and represent the type of job that is being executed.

96.What does conf.setMapper Class do?
A.Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and generating a key-value pair out of the mapper.

97.What do sorting and shuffling do?
A.Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent across to the reducers is known as Shuffling.

98.Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
A.We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.

99.What do you know about Nlineoutputformat?
A.Nlineoutputformat splits n lines of input as one split.

100.Who are all using Hadoop? Give some examples.
A. A9.com , Amazon, Adobe , AOL , Baidu , Cooliris , Facebook , NSF-Google , IBM , LinkedIn , Ning , PARC , Rackspace , StumbleUpon , Twitter , Yahoo!

Hadoop Flume Interview Questions :-

1) How can Flume be used with HBase?

2) Explain about the different channel types in Flume. Which channel type is faster?

3) Which is the reliable channel in Flume to ensure that there is no data loss?

4) Explain about the replication and multiplexing selectors in Flume.

5) Differentiate between FileSink and FileRollSink.

6) How multi-hop agent can be setup in Flume?

7) Does Apache Flume provide support for third party plug-ins?

8) Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how.

Hadoop Zookeeper Interview Questions :-

1)Define What is the role of Zookeeper in HBase architecture?

2)Explain about Zookeeper in Kafka

3)Explain how Zookeeper works.

4)List some examples of Zookeeper use cases.

5)How to use Apache Zookeeper command line interface?

6)Define What are the different types of ZNode’s?

7)Define What are watches?

8)Define What problems can be addressed by using Zookeeper?

Interview Questions on Hadoop Pig :-

1)Explain the need for MapReduce while programming in Apache Pig.

2)Explain about co-group in Pig.

3)Explain about the BloomMapFile.

4)Differentiate between Hadoop MapReduce and Pig

5)Define What is the usage of foreach operation in Pig scripts?

6)Explain about the different complex data types in Pig.

7)Define What Flatten does in Pig?

Interview Questions on Hadoop Hive

1)Explain about the different types of join in Hive.

2)How can you configure remote metastore mode in Hive?

3)Explain about the SMB Join in Hive.

4)Is it possible to change the default location of Managed Tables in Hive, if so how?

5)How data transfer happens from Hive to HDFS?

6)How can you connect an application, if you run Hive as a server?

7)Define What does the overwrite keyword denote in Hive load statement?

8)Define What is SerDe in Hive? How can you write yourown customer SerDe?

9)In case of embedded Hive, can the same metastore be used by multiple users?

Hadoop YARN Interview Questions :-

1)Define What are the additional benefits YARN brings in to Hadoop?

2)How can native libraries be included in YARN jobs?

3)Explain the differences between Hadoop 1.x and Hadoop 2.x

Or

4)Explain the difference between MapReduce1 and MapReduce 2/YARN

5)Define What are the modules that constitute the Apache Hadoop 2.0 framework?

6)Define What are the core changes in Hadoop 2.0?

7)How is the distance between two nodes defined in Hadoop?

8)Differentiate between NFS, Hadoop NameNode and JournalNode.

HADOOP - BIG DATA Interview Questions and Answers for experienced pdf

Pages

100 [UPDATED] HADOOP(BIG DATA) Interview Questions and Answers for experienced pdf