Thursday, 10 November 2016

Hadoop Basic Building Blocks

 Hello readers! This is the 4th article of our tutorial series on Hadoop and in this article, we will learn about hadoop basic building blocks. Along with this we will learn their internal structure and working.
What is hadoop?
        Hadoop is a framework to store and process very large data-sets in distributed environment made up of commodity hardware. It stores data in a distributed file system and unlike traditional clusters it performs computation on the node where data is stored.


Why Hadoop?
1. Hadoop holds the capability to store enormous, formless data in the low time interval.
2. A unique distributed computing model of hadoop process big data very fast. The processing power of hadoop depends on the number of nodes you use in hadoop cluster. More the number of nodes provides us more processing power.
3. Hadoop framework is designed in such way that we can scale it in both Vertical and Horizontal manners.
4. Unlike traditional relational data stores, hadoop is flexible to store any type of data, without any pre-processing.
5. Hadoop is highly fault tolerant. Presence of replicated data in cluster protects it against failure of hardware.

Where it is used and how?
        Moving forward from withdrawing relevant and interested knowledge from data, spam filtering and other parts, now hadoop is being used by many organisations
1. One of the most popular use if hadoop is building web-based recommendation engine. Facebook used it to recommend people you may know functionality, LinkedIns jobs you may be interested, Flipkart, Amazon, Myntra, eBay's people also viewed recommendation everywhere hadoop is used.
2. Helps us to create Data Lake, Data Lake stores data in original format to provide a unrefined view to data scientist so that they can understand and analyze it properly.
3. Hadoop is also playing a crucial role of the data warehouse, toppings of hadoop (flume, sqoop, hive) avails us Extract Transform Load facility.
4. Most of the companies are using hadoop as a cheap data storage, low cost commodity hardware makes it affordable.

Challenges in hadoop:
1. A major challenge is that hadoop is not useful in interactive analytical tasks, it is made for good simple tasks.
2. Data security is another prime challenge in hadoop. It uses Kerberos authentication protocol to secure the hadoop environment.
3. There is no tool present in the hadoop eco system to look for the data standard and data quality.
4. A very basic problem is difficult to find persons who have skills in java and the ability to do productive work in MapReduce.
The hadoop 2.0 has 4 basic building blocks as mentioned below:
1. Hadoop-commons: These are basic libraries and utilities used in hadoop
2. Hadoop Distributed File System (HDFS): It is a scalable distributed system used to store data. Stores data on multiple nodes, maintaining replicas, without any pre-processing.
3. MapReduce: It is a framework to process huge amount of data in parallel on cluster of commodity hardware.
4. Yet Another Resource Negotiator (YARN): It is a resource manager of the processes running on hadoop.
Fig 1: Hadoop 1.0 and hadoop 2.0 basic building blocks
Let us discuss the building blocks displayed in hadoop 2.0 in in above figure.
1. Hadoop Common: Hadoop common is the core of hadoop framework which provides utmost important services to hadoop. Along with that, it abstracts the underlying OS and its file system. Essential JARs and scripts, required to start and stop hadoop are also present in hadoop commons. It also provides us documentation about projects in which hadoop is contributed.

2. HDFS: HDFS is distributed file system which is used to store very large files. It stores file in a distributed manner. Depending on the size of the file it divides file into parts and stores those parts in the cluster. It provides fault tolerance by replicating those parts of file across cluster. The default count of replica is three, but you can change the count of replica. For fault tolerance, hdfs stores first replica in the same rack (We will discuss this in the next post on hdfs) to quickly overcome the failure of a node and to continue the processing and stores second replica on another rack.
HDFS in hadoop 1.0 contains NameNode, DataNodes and Secondary NameNode. Here NameNode is acts as master (a powerful server), its job is to maintain the metadata of hdfs in fsimage and edit logs. Metadata contains the Namespace information and block information of HDFS. IT does not stores actual dataset, but it knows where those data blocks are stored. NameNode is a very critical block of hadoop, whose failure makes whole hadoop cluster inaccessible. NameNode acts as a master of HDFS which makes it a single point of failure. Hence NameNode must be made up of reliable hardware. DataNodes acts as slaves, which are affordable computers (commodity hardware).
DataNodes are responsible to store actual data in HDFS (so configured with lard hard disk). They must be in continuous communication with NameNode to indicate their availability in the cluster. To do so they send a heartbeat signal (which indicated DataNodes are live) to NameNode. While starting, DataNode sends heartbeat along with the metadata to NameNode. Failure of DataNode doesn’t affect the availability of data to the job because NameNode manages this by locating the replica of the data block in the cluster and provides that replica to continue the job.
Secondary NameNode is actually the wrong name given to node, which creates confusion in newbies. Its name indicates that it is a backup node for NameNode but actually it is not. Secondary NameNodes task is to keep the fsimage by updating it with the edit logs, edit log stores the sequence of changes in the file system happened when NameNode starts. The secondary NameNode stores the checkpoint in hdfs hence it also known as checkpoint node.

3. MapReduce: MapReduce is a framework for processing large amount of data on cluster of commodity hardware, in parallel. MapReduce is made up of map task and reduce task. The map task gets input in the format and generates intermediate output in the same format of . The MapReduce library is responsible to consolidate the intermediate value of respected key and provides this as an input to reduce task. Reduce task when receives input, then by considering the intermediate keys, it merges and performs operations on the value. Reduce always performs a task after completion of map task.
The program written in the MapReduce framework automatically executes, in parallel, on the cluster. The system takes care of partitioning the data, scheduling the tasks across the cluster, failure handling and inter and intra machine communication. Programmer never comes to know about the parallelism of the program and where it is executing. The MapReduce framework is so powerful that it can process petabytes of data on a cluster of thousand nodes.

4. YARN: It is introduced in second generation hadoop-2.0 by apache open source software foundation. YARN is a resource management framework used to manage and provide resources to the applications at the time of execution. YARN provides API's and daemon processes to write generic MapReduce applications, along with this it also capable to run the applications which do not follows non-MapReduce model. Application can be a simple MapReduce job or the Directed Acyclic Graph (DAG) generated by various tools (toppings) which run their queries on hadoop, one of the example is HIVE.
The very basic idea behind YARN is to decouple the resource management, job scheduling and job monitoring into separate daemons. This brought the Resource Manager and pre-application Application Master in the picture. Resource scheduling, monitoring and management are done by the resource manager. Node manager is a daemon which runs on every machine/node and its responsibility is to launch, manage and monitor containers as well as reports the same to resource manager.
One of the key difference between YARN and MR1 is the ApplicationMaster. The execution of the application is done by ApplicationMaster, it gets the container from Resource manager and executes the application in it. ApplicationMaster is the first process which starts after application starts in container(Known as managed AM) the execution and it exists after all processes/ tasks completes.

Here we come to the end of the article. We learned about WH's of hadoop and its basic building blocks. In next article we will discuss the HDFS and MapReduce in details.Please share your views and feedback in the comment section below and stay tuned for more articles on Hadoop.

Wednesday, 19 October 2016

Hadoop : Installation of Hadoop

Hello readers! This is the 3rd article of our tutorial series on Hadoop and in this article, we will be learning about installation of Hadoop on Ubuntu system. The real power of Hadoop can be experienced by running it on the cluster of a commodity hardware, but a standalone system is sufficient to run all components of Hadoop. On the same setup, we can explore HDFS and MapReduce terminologies. Hadoop can run on any Linux distribution with minor changes in the configuration file, but for this article we will be using Ubuntu-16.04 LTS and Hadoop-2.7.2. Hadoop can also be installed on Windows OS.




Prerequisite

1. create new user named 'hadoop':
admin@codeninja:~$ adduser hadoop
    

You will get an error saying 'adduser: Only root may add a user or group to the system.' To tackle this error use 'sudo'

admin@codeninja:~$ sudo adduser hadoop    
[sudo] password for admin:       
Adding user `hadoop' ...
Adding new group `hadoop' (1001) ...
Adding new user `hadoop' (1001) with group `hadoop' ...
The home directory `/home/hadoop' already exists.  Not copying from `/etc/skel'.
Enter new UNIX password:       
Retype new UNIX password:       
passwd: password updated successfully
Changing the user information for hadoop
Enter the new value, or press ENTER for the default
 Full Name []:
 Room Number []:
 Work Phone []:
 Home Phone []: 
 Other []: 
 Is the information correct? [Y/n] y     
admin@codeninja:~$ 

2. To provide root users access to 'hadoop' user
1. Login as root
admin@codeninja:~$ su root
Password: 
root@codeninja:/home/admin# 

2. Add user 'hadoop' in sudo userd group
root@codeninja:/home/admin# sudo usermod -a -G sudo hadoop

3. Update sudoers file
root@codeninja:/home/admin# vi /etc/sudoers 
add below line in the file
hadoop ALL = (ALL:ALL) ALL

3. Install JAVA
Hadoop is developed in Java. So,java must be installed on your system. To do so, first we have to update all the pre-installed packages using command given below:

hadoop@codeninja:/home/admin$ sudo apt-get update
   

This command updates the installed packages downloading them from the repositories. Next we need to install java.To check whether java is already installed in your system, use the following command:java -version or javacIf it's not installed then you can install it using the following command:

hadoop@codeninja:/home/admin$ sudo apt-get install default-jre 
hadoop@codeninja:/home/admin$ sudo apt-get install default-jdk

4. Setting up SSH:
Generally, Hadoop components reside on different systems. All these components communicate with each other through process-process communication, for which password-less SSH access is essential. If password-less SSH is not configured, you will be asked to provide password every time. So we recommend to establish password-less communication between the systems. For this we create a RSA key pair (public-private) and append public key of one system to /root/.ssh/authorized_keys file of another system. This being standalone system we copy the RSA public key to /root/.ssh/authorized_keys file of a same system.
First of all, install openssh-server package using below command:

hadoop@codeninja:/home/admin$ sudo apt-get install openssh-server
    

To verify successful installation, run ssh localhost on terminal. By default password-less ssh is not enabled so it will ask you for the password.

hadoop@codeninja:/home/admin$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:AKZv9jio2/iM1o/WbwsIe0bnxJtivc/kqZOcS8OFpYU.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
root@localhost's password: 

Now, we enable password-less ssh to avoid password prompts. To begin with we generate RSA key value pair using ssh-keygen command as follows.

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.
The key fingerprint is:
SHA256:zDUf6fSVrrpVSCo6h/dkNlXJxMp3IBFR0tWQE5RM2W4 admin@codeninja
The key's randomart image is:
+---[RSA 2048]----+
|            =XBXo|
|            .o@.=|
|          o =o.O.|
|       o . =o=+oE|
|        S. .ooo+.|
|        o . . o  |
|       + o = o   |
|        + = +    |
|           +.    |
+----[SHA256]-----+

Now put newly generated key in authorized_keys.

hadoop@codeninja:/home/admin$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
hadoop@codeninja:/home/admin$ ssh localhost
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-36-generic i686)
* Documentation:  https://help.ubuntu.com
* Management:     https://landscape.canonical.com
* Support:        https://ubuntu.com/advantage
37 packages can be updated.
16 updates are security updates.
The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright. Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.

Hadoop Installation

1. Create directory
Create directory named hadoop in /usr/local and change access permissions of hadoop to 777

root@codeninja:/home/admin# mkdir /usr/local/hadoop
root@codeninja:/home/admin# chown hadoop /usr/local/hadoop   #Need to provide ownership of /user/local/hadoop to user hadoop 

2. Login as hadoop user and change directory
hadoop@codeninja:~$ su hadoop
password:
hadoop@codeninja:~$ cd /usr/local/hadoop/

3. Download hadoop
We are going to use wget, which downloads data in current working directory. Now our current working directory is /usr/local/hadoop (for cross check use 'pwd' command)

hadoop@codeninja:/usr/local/hadoop$ sudo wget http://apache.claz.org/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
   

Unzip hadoop-2.7.2.tar.gz in same directory using tar

hadoop@codeninja:/usr/local/hadoop$ tar -xvf hadoop-2.7.2.tar.gz
   

4. Now we have to change or update following files to complete hadoop installation
  1. ~/.bashrc
  2. /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
  3. /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/core-site.xml
  4. /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/mapred-site.xml
  5. /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/hdfs-site.xml
  6. /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/yarn-site.xml
5. Get JAVA_HOME
Before updating JAVA_HOME in .bashrc, we need to find the path where we have installed Java.The update-alternatives --config java command will give us the installation path, out of which we have to take only highlited part.

hadoop@codeninja:/usr/local/hadoop$ update-alternatives --config java
There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-openjdk-i386/jre/bin/java
Nothing to configure.
hadoop@codeninja:/usr/local/hadoop$ 

6. Update ~/.bashrc
Now we need to provide eight different environment variables in .bashrc file located at ~/.bashrc. And these are : JAVA_HOME, HADOOP_INSTALL, HADOOP_MAPRED_HOME, HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,YARN_HOME,HADOOP_COMMON_LIB_NATIVE_DIR, HADOOP_OPTS pathsNow we need to provide eight different environment variables in .bashrc file located at ~/.bashrc. And these are : JAVA_HOME, HADOOP_INSTALL, HADOOP_MAPRED_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, YARN_HOME, HADOOP_COMMON_LIB_NATIVE_DIR, HADOOP_OPTS paths.

hadoop@codeninja:/usr/local/hadoop$ gedit ~/.bashrc

Now mention below environment variables in .bashrc file

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export HADOOP_INSTALL=/usr/local/hadoop/hadoop-2.7.2
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END

7. Reload .bashrc
After setting above environment variables we need to reload .bashrc file to make that environment abailable.For that we use command source as shown below

hadoop@codeninja:/usr/local/hadoop$ source ~/.bashrc

8. Update hadoop-env.sh
We need to provide JAVA_HOME path to hadoop to make it available whenever hadoop starts. To do this just export JAVA_HOME path in hadoop-env.sh file.

hadoop@codeninja:/usr/local/hadoop$ sudo gedit hadoop-env.sh

This will open file in edit mode in gedit and then put export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386 at the place of export JAVA_HOME location(nearly at line number 25)

9. Update core-site.xml
The core-site.xml file contains hadoop configuration properties which are used by hadoop while its start up. There is no issue in using default configuration of hadoop, so just put following snippet in the core-site.xml.

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:8010</value>
   </property>
</configuration>

10. Update mapred-site.xml
It is used to tell taht which framework the hadoop will use for MapReduce job. So we need to update mapred-site.xml, but mapred-site.xml does not present in hadoop but its template is present which we have to use to create mapred-site.xml, so first create the mapred-site.xml using following command.

hadoop@codeninja:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop$ cp mapred-site.xml.template mapred-site.xml

Now update the mapred-site.xml file. Use following command to do so.

hadoop@codeninja:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop$ sudo gedit mapred-site.xml

Put below snippet in mapred-site.xml.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

11. Update hdfs-site.xml
It is prime to update the hdfs-site.xml if you want to use any node or host in hadoop cluster. In this we have to specify the directories which are used by nameNode and DataNode on same computer. In this we have to specify the local path of the directories. First of all we need to create namenode and datanode directories, use below commands to create directories. If intermediate directories are not present then -p option create specified directory with intermediate directories.

hadoop@codeninja:/usr/local$ sudo mkdir -p hadoop_store/hdfs/namenode
hadoop@codeninja:/usr/local$ sudo mkdir -p hadoop_store/hdfs/datanode
hadoop@codeninja:/usr/local$ sudo chmod -R 777 hadoop_store

Now update the hdfs-site.xml with following command.

hadoop@codeninja:/usr/local/hadoop$ sudo gedit hdfs-site.xml

Put below snippet in hdfs-site.xml

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
       <name>dfs.namenode.name.dir</name>
      <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
   </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
   </property>
</configuration>

12. Update yarn-site.xml
This file contains site specific YARN configuration properties. To update this file use below command.

hadoop@codeninja:/usr/local/hadoop$ sudo gedit yarn-site.xml

Now update the yarn-site.xml with following snippet.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

13. Format Namenode
Now in final step we need to format namenode to make a clean start. Use below command:

hadoop@codeninja:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop$ hdfs namenode -format
   

Above command will endup with below message.

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at codeninja/127.0.1.1
************************************************************/

14. Start hadoop
Refresh .bashrc file and start hadoop we need to run start-dfs.sh and start-yarn.sh commands.
hadoop@codeninja:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop$ source ~/.bashrc
   

Now run

hadoop@codeninja:~$ start-dfs.sh 
hadoop@codeninja:~$ start-yarn.sh 

15. Check whether hadoop is started or not
Run jps command,Java Virtual Machine Process Status Tool, this command is used list all java processes of user
hadoop@codeninja:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop$ jps
jps
9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode

Presence of all above processes indicates successful installation of hadoop.

16. To stop hadoop use below command

hadoop@codeninja:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop$ stop-all.sh
   

This is the end of the Hadoop Installation article.In next post we will discuss about hadoop basic building blocks. Please share your views and feedback in the comment section below and stay tuned for more articles on Hadoop.


Wednesday, 21 September 2016

Python Function : Return Values and 'return' Statement

Hello readers! You are reading the 23rd article of our tutorial series on Python and we are going to revisit Python functions in this one. In two of our previous articles, we've learned about Python function definition & calls and Python namespace & scope. Just like these articles, this one is also related to Python as we are going to know about return values and the return statement. I recommend you to go through the mentioned articles, to have a clearer idea about this article, as it is linked to them.

python-return-statement-return-values

Lets start this discussion with the important takeaways from the earlier articles on Python function :
  1. A function can be defined using the def keyword, the name of the function, followed by a pair of parenthesis.
  2. A function can have parameters to store the value of the argument, when the function is called.
  3. Then follows indented block of statements, where the entire logic is incorporated.
  4. We can call a function with its name and the parenthesis like myFunc().
  5. We can define arguments to send some information to the function.
  6. Any variables used inside a function are stored in its local namespace.
  7. We have global namespace where all the variable names declared in the program body (outside functions) reside.
  8. We have enclosed namespace when two or more functions are nested, the variable names used in enclosing functions are stored in enclosed namespace.
  9. The built-in namespace consists mainly of Python keywords, functions and exceptions.
  10. A name is searched in these namespaces in a specific order as per LEGB rule i.e. Local, Enclosed, Global and Built-in.

Return Values and return Statement

So far, we have used functions to perform some tasks and print the result, may it be a simple string or the table of a number or some power of some number, using the print statement. A print statement writes the output to the screen (stdout), which cannot be used for further processing. Say, if we have two functions power(base, exp) that calculates expth power of base and another function cubert(num) to calculate cube root of a number. We wish to calculate a number using power() and pass it as an argument to cubert() function. As we are printing out the result on the stdout, we cannot use it in another function, even if we had stored it in a variable (remember local namespace and global namespace?). So, we need to have a mechanism with which the result from one function can be used in the program, as and when needed. And we have return statement.
The return statement is optional to use. When used, it comes out of the current function and returns to the position where it was called i.e. the caller. It also sends back a value to the caller, which then can be used anywhere in the program or by another function. The value which needs to be handed back (known as return value), is decided by the programmer and it is the argument to the return statement, e.g. return object_name. When return is not used, the function returns None by default. To demonstrate this, lets check the example of our power() function created in the previous article.

Example 1 : If return is not used, None is returned.
>>> def power(base, exp) :
...     print str(base) + ' to the power of ' + str(exp) + ' is ' + str(base ** exp)
...
>>> myResult = power(3, 4)
3 to the power of 4 is 81
>>> print myResult
None
In above example, the function power() just prints the result on the screen, return statement not being used. Knowing that the print statement doesn't return anything and return not used, the default value None is returned to the caller, which then gets assigned to global variable myResult and displayed.

Example 2 : Returning a value
>>> def power(base, exp) :
...     print str(base) + ' to the power of ' + str(exp) + ' is ' + str(base ** exp)
...     return base ** exp
...
>>> myResult = power(3, 4)
3 to the power of 4 is 81
>>> print myResult
81
In above example, we actually used the return statement, that returns the result of exponentiation operation to the caller. Thus, myResult is assigned with the value returned by the power() and the same is printed. Had we written any statements after return, they would not have executed.

Example 3 : Using return values from another function
# Function fo calculate 'n'th power of a number
>>> def power(base, exp) :
...     return base ** exp
...

# Function to add two numbers
>>> def addition(a, b) :
...     return a + b
...

# Function that user power() and addition() to add squares to two numbers
>>> def myFunction(num1, num2):
...     sqr_num1 = power(num1, 2)
...     sqr_num2 = power(num2, 2)
...     result = addition(sqr_num1, sqr_num2)
...     return result
...
# Function call
>>> myFunction(3, 4)
25
>>> myFunction(8, 6)
100
In this example, we have two functions power() and addition(), those return the results of exponentiation and addition operations respectively. The third function myFunction used these two functions to calculate sum of the squares of two numbers and return the result.

With this, we end our discussion on Python function return values and return statement. In this article, we learned how the result of the operations done by a function can be used to be processed in the program using return values, which we cannot achieve using print statement. We have already discussed on the basics of function arguments, we will be learning about them much more details, in the next article. Please share your opinions and views in the comment section below and stay tuned for more interesting articles on Python. Thank you!

Tuesday, 20 September 2016

Python Namespace & Scope : Local, Global & Enclosed

Hello readers! This is the 22nd article of our tutorial series on Python and in this article, we will be learning about local and global variables. This article is related to our last article on Python function, so I recommend to have a read over it, so that things will be easier to understand. In this article, we are going to understand namespace first, where all the declared variables are actually stored and retrieved as and when required. Then we will be knowing about two important terms - local variables and global variables.

python-namespace-scope

Name and Namespace

We know that, everything in Python is an object and to identify an object easily, we assign a name to it, which we call as a variable. In our article on Python variables, we studied that, a variable is an identifier of an object and the object can be accessed using the name assigned to it. Consider that, we have declared a variable as myString = "I <3 Python!". With this, a str type of object "I <3 Python!" is stored in the memory, which can also be accesses using the name myString. Now, we can perform any operation on the object or call corresponding methods on it, by using it's name. For example, in order to capitalize every letter in the string, we can use myString.upper().
Now, if we create some more objects and assign names to them, those will be stored in memory locations and will be available to be accessed using the names assigned to them. With this, you can consider a 'namespace' to be a place where all these names are stored. In simple words, a namespace is the collection of all the names in the form of dictionary items as name : object. To more understand this, I would like to introduce globals() function here, that will display the mapping. For now, don't bother about why the name globals() and all, we will see it later. This is just to show you how the namespace looks like.
>>> myString = "I <3 Python!"
>>> myList = [1, 2, 3, 4]
>>> myTuple = ('A', 'B', 'C')
>>> globals()
{'myTuple': ('A', 'B', 'C'), 'myList': [1, 2, 3, 4], '__builtins__': <module '__builtin__' (built-in)>, '__package__': None, 'myString': 'I <3 Python!', '__name__': '__main__', '__doc__': None}
In above example, we have created three objects and assigned names myString, myList and myTuple to them. To display the namespace, we called Python built-in function globals(), which returned a dict type, with items in name : object format, 'myTuple': ('A', 'B', 'C') being one of them. Thus, as soon as we create a name and assign a value to it, it becomes a part of namespace. We have zero or more local namespaces and a global namespace. Based on where in the code the assignment happens, the namespace which the variable will be a part of, is decided. Let us discuss this in the next section.

Global Namespace and Local Namespace
As mentioned above, a name is a part of which namespace, is decided by where the variable is declared in the program. This arises one more question in our mind that, Where can we declare a name?. Answer will be, a name can be defined in the main program body, in a function or in a module (a module can be considered as a pre-existing Python program file, from where an object can be imported). If a name is declared in the main program body, it will be a part of global namespace, while a name definition inside a function makes it a part of local namespace. Every function declared has its own local namespace, so that the values assigned to a name inside a function do not get overwritten with the names declared in other functions or main program body (global namespace). In conclusion -

1. We have a global namespace, where all the names declared in the main program body reside.
2. Every function is associated with its own namespace, known as local namespace. We can have multiple local namespaces in a program, as we can have zero or more functions in our piece of code.
3. A name declared in a function can only be accessed within that function
4. A name can be a part of one or more namespaces, but their values do not collide with each other. For example, name myVar declared in main program body (global namespace) and a name myVar declared in a function myFunc are absolutely different entities and their value do not clash with one another.
5. This also means that, in case of nested functions (function inside another function) will have individual namespaces associated with each of them. In this case, the namespace of enclosing function is often called as Enclosed namespace.
6. Every call to a function creates a new local namespace.
7. We also have Built-in namespace, which is reserved by Python, consisting mainly of exceptions and built-in functions.
There might be numerous namespaces created in a program, but all the namespaces are not available to be accessed from anywhere in the program. This introduces a new term Scope, that defines the section of the program from where the namespace can be accessed. We discuss this in the next section.

Scope

In the above section, we have come to know that a program may have a number of namespaces, which are independent from each other and not accessible from everywhere in the program. A scope decides the part of the program from where the name can be accessible. So, based on where the name is referenced in the program, the name is searched in the namespaces in a specific order as per LEGB (Local -> Enclosed -> Global -> Built-in) rule.
As per LEGB rule, the order of search should be -
1. Local scope (or namespace) - The name is searched in the local namespace, if it is referenced inside a def.
2. Enclosed scope - If not present in local namespace, the namespace of enclosing def is searched.
3. Global scope - If not found, global namespace - the top level namespace in a file, is looked into.
4. Built-in scope - If still not found, it searches the Built-in scope, which is nothing but the Python builtin module.
5. If the name is not found in any of above mentioned scopes, a KeyError exception is raised. We see some examples in order to verify everything we studied in this article, so far.
Example 1 : Local and Global Scope
>>> myVar = "Global"
>>> def myFunction() :
...     myVar = "Local"
...     print 'Value of myVar inside function = ' + myVar
...
>>> print 'Value of myVar outside function = ' + myVar
Value of myVar outside function = Global
>>> myFunction()
Value of myVar inside function = Local
In above example, we have declared a name myVar = "Global" in the main program body and myVar = "Local" inside a function myFunction. When we use the variable in the main program, its value from the global namespace is referred, whereas when the function is called, it's value from local namespace is referred, as per LEGB rule.
Example 2 : Local, Enclosed and Global Scope
>>> myVar = "Global"
>>> def outerFunction():
...     myVar = "Enclosed"
...     print 'Value of myVar in outerFunction() = ' + myVar
...     def innerFunction():
...             myVar = "Local"
...             print 'Value of myVar in innerFunction() = ' + myVar
...     innerFunction()
...
>>> print 'Value of myVar in main program = ' + myVar
Value of myVar in main program = Global
>>> outerFunction()
Value of myVar in outerFunction() = Enclosed
Value of myVar in innerFunction() = Local
In above example, we have myVar declared thrice with values "Global" in the main program body, "Enclosed" inside outerFunction and "Local" in innerFunction, to indicate which namespace value myVar takes when referenced at different locations in the program.
Example 3 : Local, Enclosed, Global and Built-in Scope
As discussed earlier, the __builtin__ module contains all keywords reserved by Python. To check the all of them, we import the module and provide it as an argument to dir() function. We will observe some exceptions and built-in functions in the long list returned.
# Snipping the long list
>>> dir(__builtin__)
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BufferError', ... , 'tuple', 'type', 'unichr', 'unicode', 'vars', 'xrange', 'zip']
Now, we create our own pow() function that does the reverse operation i.e. Exponent to the power Base and check if it gets executed before the one in built-in scope.
>>> def pow(base, exp):
...     print 'Result from the pow function = ' + str(exp ** base)
...
>>> def myFunction(base, exp):
...     print 'Calling pow() from myFunction...'
...     pow(base, exp)
...     print 'Result from myFunction = ' + str(base ** exp)
...
>>> myFunction(3, 4)
Calling pow() from myFunction...
Result from the pow function = 64
Result from myFunction = 81
From the above results, we observe that, the pow() function from the global scope is executed before the pow() function in the built-in scope.
With this, we've come to an end of this article. In this article, we learned mainly about namespace and scope. Essentially, we understood how the names are searched in the scopes, based on where they are referenced in the program. In the next article, we study about return values and return statement. Please share your views and opinions in the comment section below and stay tuned for more articles. Thank you.

Friday, 16 September 2016

Hadoop-Bigdata : Introduction and Application

Hello readers! This is the 2nd article of our tutorial series on Hadoop and in this article we will be discussing about one of the forefront and highly crunched technology Bigdata. We have already come across this term in last article Introduction to Hadoop. Bigdata is closely related to Hadoop as Hadoop is used to process the Bigdata, hence it is of utmost importance to know about Bigdata.

hadoop-bigdata-introduction-application

  Bigdata is a data set which is beyond storing and processing capacity of a commodity hardware and traditional tools those process data in an effective manner.  In day to day life, while purchasing any mobile, laptops, vehicles, advance booking for movies etc. we always check and compare price, quality, reviews and ratings and so many things. All these activities are related to data, which will help us to take an appropriate decision.
A naive to Bigdata considers it as a huge, untidy data which is needed to be analyzed. Though it is correct to some extend but what is actually a Bigdata, how it looks, what are the properties it have. We are going to discuss all these along with the introductory solutions and later on we will discuss those solutions in details.

What kind of data is Bigdata?

Large (Volume), rapidly generating (Velocity), formless data (Veracity) of different types (Variety) is a Bigdata, which we have to turn into a useful and meaningful information(Value), i.e. Knowledge. Among the 5 V's mentioned in the parenthesis, the last V, Value, is very crucial. Our ultimate aim is  to extract this V, the valuable information, out of all the Bigdata we have. First of all, let us know about those 5V's of Bigdata in details:

1. Volume : The size of data generated and stored can actually define whether it is Bigdata or not. Bigdata is actually so huge that you can not imagine. It is not in GBs (GigaBytes) or TBs (TeraBytes) but in XBs(ExaBytes) and ZBs (ZettaBytes). This much volume of data is impossible to store in the storage hardware we have.
As an example, consider the case of data being generated in Facebook. As per the record, monthly active users on Facebook are 1.71 billion and on an average, 57% of them like, 33% comment and 33% share posts. All these activities are stored in the form of logs to create huge log files, approximately 600 TB/day, i.e. about 30PB in a month. This is how the volume of Bigdata is occupying the digital world.

2. Velocity : This is nothing but the rate at which data is being generated from social network sites, businesses, networks, machines and majorly by human interaction. As discussed above, the volume of data generated in a day at Facebook is almost 600TB, while wireless sensors generate 2PB in a day, across the globe. The stock market, weather reports, and networks are also generating data at the similar rates, hence contributing to Bigdata.

3. Variety : It depicts the diversity in the types of data generated. One of the best examples is, we post images, videos and texts on Facebook, hash tags, speeches, typos on Twitter. The data is of different kinds, but the volume is humongous.

4. Veracity : It shows how messed up the data is or how correct the data is. As the huge data is generated from reliable as well as misleading sources in many forms, its very hard to trust the quality and accuracy of data. Thus, the volume is the decisive factor as far as data veracity and integrity is concerned.

5. Value : Whatever we are doing, may it be learning Bigdata technologies or developing different tools or manipulating huge, messed up and unstructured data is to analyse and filter out valuable information i.e. knowledge, which is the outcome. This outcome is incorporated into recommendation engines in Flipkart, Amazon, Myntra and many other organisations to target customers. It plays an important role in optimising business processes to analyse social media data and make predictions about the stock market. We will discuss the benefits of analysing the Bigdata in a later section of this article.

Types of data : 

1. Structured data: Data in the relational form i.e. data that can be stored in tabular form (in row-column format). Employee database, Telephone directory, Airline schedules are some of the best examples of structured data.
2. Semi-Structured data: Semi-structured data is not so organized to be stored in RDBMS, and not so scrumbled to be called as unstructured data. XML files, CSV files, metadata of a table can be considered as semistructured data.
3. Unstructured data: Absolutely untidy data which can not be arranged in relational or tabular form at any cost, is considered as unstructured data. Log files, Text files can be considered as unstructured data.

What data comes under Bigdata?

1. Social media data: Analyzing the posts, likes, and shares of users on Facebook, likes and tweets on Twitter are helpful to get real-time insight into what users thinks about products and different campaigns. This helps companies to take useful decisions about pricing, promotions, and campaigns.
2. Stock exchange data: It holds information about treading done by users, the buy and sell information of users. Analysis of such data is much helpful to companies as well as to users also.
3. Search engine data: It retrieves data from a lot of sources and databases, generates results and analyses according to different filters applied by users.
4. Fraud detection data: It is one of the compelling application of Bigdata. It is proved that the goal of fraud detection is majorly achieved by Bigdata. In earlier of Bigdata era, there were techniques to detect fraud but they were taking a lot of time to detect fraud. Because of huge data sets, we had to face the damage and we left with the only option to minimize further harm and prevent it from happening again.
So, any data that comes under 5Vs is a Bigdata.

Advantages over RDBMS:

1. RDBS handles only structured data, while Bigdata works on any type of data.
2. We can able to store 600-700Gb data only in RDBMS.There is no storage limitation in Bigdata.
3. Costlier than Bigdata (Exadata Database Machine X6-8 High Capacity (HC) Full Rack price is up to 1.3Million USD). Bigdata tools come under FOSS(Free and Open Source Software).

Limitations of Bigdata :

1. It works only on old data (because first, we need to store data in Hadoop Distributed File System before processing and then we can process it).
2. Doesn't works on streaming data, a continuously generating real-time data.
3. Doesn't works on OLTP(Online Transaction Processing)
4. The major issue before working on Bigdata is storing the Bigdata.
5. You cannot modify data stored in HDFS because it works on write once read many forms.

Tools used to crunch Bigdata:

1. Apache hadoop: made up of HDFS and MapReduce, HDFS is a storage for storing Bigdata and MapReduce is processing framework/engine to perform operations on stored data.
2. Apache Pig: Pig is an abstraction on MapReduce, used to perform Extract Transform Load(ETL) operations, analysing the data, rapid development, and ad-hoc queries.
3. Apache Hive: Hive works on Structured data only and it is designed specifically for SQL users. It performs easy ETL, provides a mechanism to impose a structure to the variety of data.
4. Apache HBase: It stores data in Big tables and provides random access to stored data. HBase is highly scalable, provides consistent read/write operations.
5. Apache Sqoop: Helps to move data to and from HDFS to relational database servers like from HDFS to MySQL and vice versa.
6. Apache Flume: Used for collecting, aggregating and storing the streaming data into HDFS
7. Apache YARN(Yet Another Resource Negotiator): It is a cluster management technology used in Hadoop 2.0
8. Apache Spark: Designed for high speed output and sophisticated analytics.

Where the Bigdata is useful:

1. Banking and security domain
2. Social networking (Facebook, Twitter, Google+ etc.)
3. E-commerce business (Offergear, Amazon , Flipkart, Snapdeal etc.)
4. Medical and healthcare
5. Education (Online courses, blogs, articles etc.)
6. Finding and processing natural resources like oil, gas, coal etc.
7. Goverment sector
8. Transporatation
9. Share-marketing (NSE,BSE etc.)
Thus, we have come to and end of this article. In this article, we have learned mainly about Bigdata, its need and its applications. In the next article, we will learn how to install Hadoop on Ubuntu Linux. Please share your views and feedback in the comment section below and stay tuned for more articles on Hadoop.

Tuesday, 13 September 2016

Python Function Basics - Function Definition and Function Call

Hello readers! This is the 21st article of our tutorial series on Python and we start our discussion on Python functions, from this article. In our recent articles, we have discussed on loop statements - for-else loop and while-else loop. In this article, we will be learning about what functions are, how we can create our own function and how we can use it.

python-function-basics-function-definition-function-call-function-argument-and-function-parameter

Functions in any programming language is a group of statements, identified by a name, that perform a specific task. Whenever we want to re-use these statements in the code, instead of writing them all every time, we can invoke them using the name specified. This will not only save our copy-paste efforts but also reduce the code redundancy in the program. Also, for any correction to be made to any statement in the group, one need not correct every occurrence of it. Instead, it would be sufficient to make appropriate corrections in the defined function. Things will be clear when we go through some examples of functions. But to start with, lets learn how functions are created or defined.

Function definition

To define a function, Python has a keyword def, which gives us a function type or an object of class function. Please have a look at the basic syntax first, then we will try to unfold everything.
def function_name():
    # Block of code starts
    ...
    ...
    ...
    ...
    # Block of code ends
In the syntax above, the line starting with def is often called as function header and it consists of the keyword def, the name of the function and a pair of parenthesis. In the parenthesis, we can mention zero or more parameters, which we will discuss about later. After that, we have an indented block of statements, which we can call as 'function body', that will be executed at every run of the function. So, if we were to define our own function myFirstFunction which does nothing, we can have a pass statement in the function body as shown below:
>>> def myFirstFunction() :
... pass
...
Congrats, you've just created your first Python function. Had we replaced the pass statement with print 'Hello World', it would have printed Hello World on the screen. Shall we check the what type object a Python function is? We require a Python built-in function for this, which is dir() function.
>>> type(myFirstFunction)
<type 'function'>
As I said earlier, the resultant object is of function type. Now, we have created a function, but how to use it? Lets see this in following section.

Function Call

A function call is nothing but making use of the function you have created. Well, to do this, you just need to mention the function name followed by a pair of parenthesis. Remember, with the def keyword, you just created a function. Unless you create a function, you cannot call it. So, the function definition must come before the function call or you will face an error message. Lets try defining a function myFunction, that prints Congrats! You have just executed a function, and calling it.
>>> def myFunction() :
...    print "Congrats! You have just executed a function"
...

>>> myFunction()
Congrats! You have just executed a function
As mentioned above, in order to call a function, we just mentioned it's name followed by the parenthesis and we have the function executed. Next, we need to know about two important terms related to Python functions, which I might have mentioned earlier, but have to be understood in detail - Function Argument and Function Parameter

Function Argument and Function Parameter

We have already used some functions like dir() and len() in our previous articles, in which dir and len are the function names. We have used them to get a list of attributes and methods associated with the object and to find length of the object respectively. So, in order to get the attribute list of a list myList, we would use dir(myList) or to find its length, we would use len(myList), where we call myList as an Argument to dir() and len() functions. With an argument, we send some information, in the form of variable or a value, to a function. On the other hand, function receives these values using another variable, declared in the function header, which is called as a Parameter.
To understand this, lets take an example of any Python built-in function, say dir() function. The built-in functions are defined somewhere in the Python library and they must be defined in somewhat similar syntax as seen earlier. So, the dir() function might look like :
def dir( param ) :
    ....
    ....
    ....
    ....
Thus, the variable param is the parameter of the function. When we actually use the function as dir(myList), by providing an argument myList to that function, the function reads the value stored in the variable myList and assigns it to the variable param. After some processing on it, we finally get appropriate output. Take a look at below examples.
Example 1 : Table of a number
>>> def table(myNum) :
...     print '1 * ' + str(myNum) + ' = ' + str(1 * myNum)
...     print '2 * ' + str(myNum) + ' = ' + str(2 * myNum)
...     print '3 * ' + str(myNum) + ' = ' + str(3 * myNum)
...     print '4 * ' + str(myNum) + ' = ' + str(4 * myNum)
...     print '5 * ' + str(myNum) + ' = ' + str(5 * myNum)
...     print '6 * ' + str(myNum) + ' = ' + str(6 * myNum)
...     print '7 * ' + str(myNum) + ' = ' + str(7 * myNum)
...     print '8 * ' + str(myNum) + ' = ' + str(8 * myNum)
...     print '9 * ' + str(myNum) + ' = ' + str(9 * myNum)
...     print '10 * ' + str(myNum) + ' = ' + str(10 * myNum)
...

>>> table(5)
1 * 5 = 5
2 * 5 = 10
3 * 5 = 15
4 * 5 = 20
5 * 5 = 25
6 * 5 = 30
7 * 5 = 35
8 * 5 = 40
9 * 5 = 45
10 * 5 = 50
In above example, we have created a function with name table which prints the table of a number provided as an argument. It uses the parameter myNum to store the number. After some operations on myNum, we print the result. We also checked calling it with number 5 as an argument and it prints the table of 5. Now, in order to print the table of 7, we need not write the statements again, just call the function as table(7) and you are done. Conclusion - Functions improve the re-usability of the code and reduces redundancy.
Example 2 : Power of a number
>>> def power(base, exp) :
...     print str(base) + ' to the power of ' + str(exp) + ' is ' + str(base ** exp)
...

>>> power(2, 3)
2 to the power of 3 is 8
>>> power(3, 4)
3 to the power of 4 is 81
>>> power(6, 3)
6 to the power of 3 is 216
In this examples, we have created a function with name power that has two parameters base to store the value of base and exp to store the value of exponent. This function will calculate the expth power of base and print it on the screen. While calling the function, we have mentioned two arguments every time and got appropriate results in the output.
In this way, we have come to an end of this discussion of this article. This article was intended to provide you very basic knowledge about functions, in which we studies to define a function and call a function. Meanwhile, we also learned about arguments and parameters and used them in our function definition and function call. Although, we are not finished with functions, I recommend you to practice on whatever we have studied and start writing some functions of yours. Please write your reviews and feedback in the comment section below. In the next article, we will cover scope of the variables. Till then, stay tuned. Thank you.