Bigdata Interview Questions and Answers
What is Speculative Execution in Hadoop?
Ans:- Hadoop does not fix or diagnose slow-running tasks. Instead, it tries to detect when a task is running slower than expected and launches another, an equivalent task as a backup (the backup task is called a speculative task). This process is called speculative execution in Hadoop.
There may be various reasons for the slowdown of tasks, including hardware degradation or software misconfiguration, but it may be difficult to detect causes since the tasks still complete successfully, although more time is taken than the expected time. Hadoop doesn’t try to diagnose and fix slow running tasks, instead, it tries to detect them and runs backup tasks for them. This is called speculative execution in Hadoop. These backup tasks are called Speculative tasks in Hadoop
What is the difference between Text Input format and Key Value format in Hadoop?
Ans:- The TextInputFormat class converts every row of the source file into key/value types where the BytesWritable key represents the offset of the record and the Text value represents the entire record itself.
The KeyValueTextInputFormat is an extended version of TextInputFormat , which is useful when we have to fetch every source record as Text/Text pair where the key/value were populated from the record by splitting the record with a fixed delimiter.
Consider the Below file contents,
AL#Alabama
AR#Arkansas
FL#Florida
If TextInputFormat is configured , you might see the key/value pairs as,
0 AL#Alabama
14 AR#Arkansas
23 FL#Florida
if KeyvalueTextInputFormat is configured with conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", "#") , you might see the results as,
AL Alabama
AR Arkansas
FL Florida
Log file contains entries like user A visited page 1, user B visited page 3, user C visited page 2, user D visited page no 4 . How will you implement a Hadoop job for this to answer the following queries in real-time – Which page was visited by user C more than 4 times in a day and Which page was visited by only one user exactly 3 times in a day?
Ans:-
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
Run below two queries to load data in hive
> create table urldata (ip string, url string) ;
> load data local inpath '/path/to/data' into table urldata;
> create view v1 as select ip,urls, count(*) as c from urldata group by ip,urls;
192.168.198.92 www.askubuntu.com 2
192.168.198.92 www.facebook.com 4
192.168.198.92 www.gmail.com 1
192.168.198.92 www.google.com 5
192.168.198.92 www.indiabix.com 1
192.168.198.92 www.m4maths.com 3
192.168.198.92 www.yahoo.com 3
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
Run below two queries to load data in hive
> create table urldata (ip string, url string) ;
> load data local inpath '/path/to/data' into table urldata;
> create view v1 as select ip,urls, count(*) as c from urldata group by ip,urls;
192.168.198.92 www.askubuntu.com 2
192.168.198.92 www.facebook.com 4
192.168.198.92 www.gmail.com 1
192.168.198.92 www.google.com 5
192.168.198.92 www.indiabix.com 1
192.168.198.92 www.m4maths.com 3
192.168.198.92 www.yahoo.com 3
Implement word count program in Apache Hive.
Ans:-
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE wordcounts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
CREATE TABLE wordcounts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;
Which object will you use to track the progress of a job?
Ans:- JobClient
Difference between Bucketing and Partitioning and when will you use each of these?
Ans:- Partition divides large amount of data into multiple slices based on value of a table column(s).
Assume that you are storing information of people in entire world spread across 196+ countries spanning around 500 crores of entries. If you want to query people from a particular country (Vatican city), in absence of partitioning, you have to scan all 500 crores of entries even to fetch thousand entries of a country. If you partition the table based on country, you can fine tune querying process by just checking the data for only one country partition. Hive partition creates a separate directory for a column(s) value.
When creating partitions you have to be very cautious with the number of partitions it creates, as having too many partitions creates too many sub-directories in a table directory which bring unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
*Pros:
Distribute execution load horizontally
Faster execution of queries in case of partition with low volume of data. e.g. Get the population from "Vatican city" returns very fast instead of searching entire population of world.
* Cons:
Possibility of too many small partition creations - too many directories.
Effective for low volume data for a given partition. But some queries like group by on high volume of data still take long time to execute. e.g. Grouping of population of China will take long time compared to grouping of population in Vatican city. Partition is not solving responsiveness problem in case of data skewing towards a particular partition value.
EXAMPLE:-
Create the managed table as below with the partition column as state.
CREATE TABLE zipcodes(
RecordNumber int,
Country string,
City string,
Zipcode int)
PARTITIONED BY(state string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Hive Bucketing(i.e Clustering):
Bucketing decomposes data into more manageable or equal parts(By specifying the number of buckets to create).
EXAMPLE:-
In the below example, we are creating a bucketing on zipcode column on top of partitioned by state.
CREATE TABLE zipcodes(
RecordNumber int,
Country string,
City string,
Zipcode int)
PARTITIONED BY(state string)
CLUSTERED BY Zipcode INTO 10 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ;
Some of the differences between Partitioning vs bucketing:
Ans:- i)In PARTITIONING, Directory is created on HDFS for each partition. Where in Bucketing, File is created on HDFS for each bucket.
ii) You can have one or more Partition columns in partitioning where You can have only one Bucketing column in bucketing
iii) You can’t manage the number of partitions to create where in bucketing, You can manage the number of buckets to create by specifying the count
iv) Uses " PARTITIONED BY " in partitioning where in bucketing Uses " CLUSTERED BY "
How Does YARN Work?
Ans:- A basic workflow for deployment in YARN starts when a client application submits a request to the ResourceManager.
i) The ResourceManager instructs a NodeManager to start an Application Master for this request, which is then started in a container.
ii) The newly created Application Master registers itself with the RM. The Application Master proceeds to contact the HDFS NameNode and determine the location of the needed data blocks and calculates the amount of map and reduce tasks needed to process the data.
iii) The Application Master then requests the needed resources from the RM and continues to communicate the resource requirements throughout the life-cycle of the container.
iv) The RM schedules the resources along with the requests from all the other Application Masters and queues their requests. As resources become available, the RM makes them available to the Application Master on a specific slave node.
v) The Application Manager contacts the NodeManager for that slave node and requests it to create a container by providing variables, authentication tokens, and the command string for the process. Based on that request, the NodeManager creates and starts the container.
vi) The Application Manager then monitors the process and reacts in the event of failure by restarting the process on the next available slot. If it fails after four different attempts, the entire job fails. Throughout this process, the Application Manager responds to client status requests.
Once all tasks are completed, the Application Master sends the result to the client application, informs the RM that the application has completed its task, deregisters itself from the Resource Manager, and shuts itself down.
The RM can also instruct the NameNode to terminate a specific container during the process in case of a processing priority change.
Ans:- i) Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs.
ii) Apache Hive is a data warehouse system that's built on top of Hadoop. Apache HBase is a NoSQL key/value store on top of HDFS or Alluxio.
iii) Apache Hive provides SQL features to Spark/Hadoop data. HBase can store or process Hadoop data with near real-time read/write needs.
iv) Hive should be used for analytical querying of data collected over a period of time. HBase is primarily used to store and process unstructured Hadoop data as a lake.
v) HBase is perfect for real-time querying of Big Data. Hive should not be used for real-time querying.
vi) Hive is query engine that whereas HBase is a data storage particularly for unstructured data.
vii) Apache Hive is mainly used for batch processing i.e. OLAP but HBase is extensively used for transactional processing wherein the response time of the query is not highly interactive i.e. OLTP.
viii) Unlike Hive, operations in HBase are run in real-time on the database instead of transforming into MapReduce jobs.
ix) HBase is to real-time querying and Hive is to analytical queries.
What is Distributed Cache in Hadoop?
Ans:- Distributed Cache in Hadoop is a facility provided by the MapReduce framework. Distributed Cache can cache files when needed by the applications. It can cache read only text files, archives, jar files etc.
Once we have cached a file for our job, Apache Hadoop will make it available on each datanodes where map/reduce tasks are running. Thus, we can access files from all the datanodes in our MapReduce job.
By default, distributed cache size is 10 GB. If we want to adjust the size of distributed cache we can adjust by using local.cache.size.
Advantages of Distributed Cache
Single point of failure- As distributed cache run across many nodes. Hence, the failure of a single node does not result in a complete failure of the cache.
Data Consistency- It tracks the modification timestamps of cache files. It then, notifies that the files should not change until a job is executing. Using hashing algorithm, the cache engine can always determine on which node a particular key-value resides. As we know, that there is always a single state of the cache cluster, so, it is never inconsistent.
Store complex data – It distributes simple, read-only text file. It also stores complex types like jars, archives. These achieves are then un-archived at the slave node.
What is InputSplits in Hadoop?
Ans:- InputSplits are created by logical division of data, which serves as the input to a single Mapper job. Blocks, on the other hand, are created by the physical division of data. One input split can spread across multiple physical blocks of data.
The basic need of Input splits is to feed accurate logical locations of data correctly to the Mapper so that each Mapper can process complete set of data spread over more than one blocks. When Hadoop submits a job, it splits the input data logically (Input splits) and these are processed by each Mapper. The number of Mappers is equal to the number of input splits created.
Hadoop framework divides the large file into blocks (64MB or 128 MB) and stores in the slave nodes. HDFS is unware of the content of the block. While writing the data into block it may happen that the record crosses the block limit and part of same record is written on one block and the other is written on other block.
So, the way Hadoop tracks this split of data is by the logical representation of the data known as Input Split. When Map Reduce client calculates the input splits, it actually checks if the entire record resides in the same block or not. If the record over heads and some part of it is written into another block, the input split captures the location information of the next Block and byte offset of the data needed to complete the record. This usually happens in the multi-line record as Hadoop is intelligent enough to handle the single line record scenario.
Usually, input split is configured same as the size of block size but consider if the input split is larger than the block size. Input split represents the size of data that will go in one mapper. Consider below example
• Input split = 256MB
• Block size = 128 MB
Then, mapper will process two blocks that can be on different machines. Which means to process the block the mapper will have to transfer the data between machines to process. Hence to avoid the unnecessary data movement (data locality) we usually keep the same Input split as block size.
Even in case of data locality optimization some data travels over network which is an overhead. This data transfer is temporary.
What is a rack?
Ans:- The Rack is the collection of around 40-50 DataNodes connected using the same network switch. If the network goes down, the whole rack will be unavailable. A large Hadoop cluster is deployed in multiple racks.
What is Rack Awareness in Hadoop HDFS?
Ans:- In a large Hadoop cluster, there are multiple racks. Each rack consists of DataNodes. Communication between the DataNodes on the same rack is more efficient as compared to the communication between DataNodes residing on different racks.
To reduce the network traffic during file read/write, NameNode chooses the closest DataNode for serving the client read/write request. NameNode maintains rack ids of each DataNode to achieve this rack information. This concept of choosing the closest DataNode based on the rack information is known as Rack Awareness.
How many joins does MapReduce have and when will you use each type of join?
Ans:- i) Map Side Join
ii) Reduce Side Joini) MAP SIDE JOIN:- As the name implies, the join operation is performed in the map phase itself. Therefore, in the map side join, the mapper performs the join and it is mandatory that the input to each map is partitioned and sorted according to the keys.
The map side join has been covered in a separate blog with an example. Click Here to go through that blog to understand how the map side join works and what are its advantages.
ii) Reduce Side Join:- Reduce Side Join: As the name suggests, in the reduce side join, the reducer is responsible for performing the join operation. It is comparatively simple and easier to implement than the map side join as the sorting and shuffling phase sends the values having identical keys to the same reducer and therefore, by default, the data is organized for us.
Now, let us understand the reduce side join in detail.
Implement word count program in Apache Hive?
Ans:- CREATE TABLE FILES (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE FILES;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, ' ')) AS word FROM FILES) w
GROUP BY word
ORDER BY word;
How to search a large number of files(suppose 100,000 files) which are spread across multiple servers in Hadoop?
Ans:- First put all the files in to Hdfs/(may be s3)/hive by gathering them via sftp of ftp etc....
then you have unified storage hdfs. you can apply mapreduce or spark etc to process them based on your requirements.
No one can do anything if they have wide variety of data sources / servers instead of gathering them ( called data ingestion ) followed by data processing using any available frameworks.
What is a heartbeat in Hadoop parlance?
Ans:- Heartbeat is a form of communication (a signal) shared between a data node and NameNode. If the NameNode or job tracker does not respond to this communication attempt, it means that there is an error in the system
How do you fix NameNode when it is down?
Ans:- The following steps can be followed to fix NameNode:
* FsImage, the file systems metadata replica, should be used to start a new NameNode
* Configuration of datanodes to acknowledge the creation of this new NameNode
*NameNode will begin its operation and the cluster will go back to normalcy after it has completely loaded the last FsImage checkpoint.
In some cases, NameNode revival can take a lot of time.
What is a Checkpoint?
Ans:- A checkpoint is the last load of saved data. It captures FsImage and edits the namespace log, then compacts both into a new FsImage. This is a continuous process. This task of creating a checkpoint is performed by Secondary NameNode.
AVRO File format?
Ans:- For the long-term schema storage, AVRO file-format is best -suited. AVRO file store the meta-data with the data and also specify the independent schema in order to read the files.
List the general steps to debug a code in Hadoop?
Ans:- Following are the steps involved in debugging a code:
• Check the list of MapReduce jobs currently running• If orphaned jobs are running, check the ResourceManager by executing the following code :- ps –ef | grep –I ResourceManager
• Check the log directory to detect any error messages that may be shown
• Basis the logs found in the above step, check the worker node involved in the action that may have the buggy code
• Log in to the node by executing the following code:- ps –ef | grep –iNodeManager
• Examination of MapReduce log to find out the source of error.
This is the process for most error-detection tasks in the Hadoop cluster system.
Different Hadoop configuration files?
Ans:-
- hadoop-env.sh
- mapred-site.xml
- core-site.xml
- yarn-site.xml
- hdfs-site.xml
- Master and Slaves