Install and configure Apache Phoenix on Cloudera Hadoop CDH5



          Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

Step 1: Download Latest version of Phoenix using command given below


--2015-11-23 12:20:21-- http://mirror.reverse.net/pub/apache/phoenix/phoenix-4.3.1/bin/phoenix-4.3.1-bin.tar.gz
Resolving mirror.reverse.net... 208.100.14.200
Connecting to mirror.reverse.net|208.100.14.200|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 72155049 (69M) [application/x-gzip]
Saving to: “phoenix-4.3.1-bin.tar.gz.1”
100%[=====================================] 72,155,049   614K/s   in 2m 15s
2015-04-10 12:25:45 (521 KB/s) - “phoenix-4.3.1-bin.tar.gz.1” saved [72155049/72155049]

Step 2: Extract the downloaded tar file to convenient location

[root@maniadmin ~]# tar -zxvf phoenix-4.3.1-bin.tar.gz
phoenix-4.3.1-bin/bin/hadoop-metrics2-phoenix.properties
-
-
phoenix-4.3.1-bin/examples/WEB_STAT.sql

Step 3: Copy phoenix-4.3.1-server.jar to hbase libs on each reagion server and master server


On master server you should copy “phoenix-4.3.1-server.jar” at “/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/” location


On Hbase region server you should copy “phoenix-4.3.1-server.jar” at /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/ location


Step 4: Copy phoenix-4.3.1-client.jar to each Hbase region server

Please make sure to have phoenix-4.3.1-client.jar at /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hbase/lib/ on each region sever.


Step 5: Restart hbase services via Cloudera manager

Step 6: Testing – Goto extracted_dir/bin and run below command


[root@maniadmin bin]# ./psql.py localhost ../examples/WEB_STAT.sql ../examples/WEB_STAT.csv ../examples/WEB_STAT_QUERIES.sql 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
15/11/23 13:51:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
no rows upserted
Time: 2.297 sec(s)
csv columns from database.
CSV Upsert complete. 39 rows upserted
Time: 0.554 sec(s)
DOMAIN                                                         AVERAGE_CPU_USAGE                         AVERAGE_DB_USAGE---------------------------------------- ---------------------------------------- ----------------------------------------
Salesforce.com                                                           260.727                                 257.636
Google.com                                                              212.875                                   213.75
Apple.com                                                                 114.111                                 119.556
Time: 0.2 sec(s)
DAY                                             TOTAL_CPU_USAGE                           MIN_CPU_USAGE                           MAX_CPU_USAGE
----------------------- ---------------------------------------- ---------------------------------------- ----------------------------------------
2013-01-01 00:00:00.000                                       35                                       35                                       35
2013-01-02 00:00:00.000                                     150                                       25                                      125
2013-01-03 00:00:00.000                                       88                                       88                                       88
-
-
2013-01-04 00:00:00.000                                       26                                      3                                       232013-01-05 00:00:00.000                                     550                                       75                                     475
Time: 0.09 sec(s)
HO                   TOTAL_ACTIVE_VISITORS
-- ----------------------------------------
EU                                     150
NA                                       1
Time: 0.052 sec(s)
Done.








Step 7: To get sql shell


[root@maniadmin bin]# ./sqlline.py localhost
Setting property: [isolation, TRANSACTION_READ_COMMITTED]
issuing: !connect jdbc:phoenix:localhost none none org.apache.phoenix.jdbc.PhoenixDriver
Connecting to jdbc:phoenix:localhost
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
15/11/23 14:58:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connected to: Phoenix (version 4.3)
Driver: PhoenixEmbeddedDriver (version 4.3)
Autocommit status: true
Transaction isolation: TRANSACTION_READ_COMMITTED
Building list of tables and columns for tab-completion (set fastconnect to true to skip)...
77/77 (100%) Done
Done
sqlline version 1.1.8
0: jdbc:phoenix:localhost>



Amazon in-house interview questions

Once you clear the 2/3 telephonic rounds they will invite you for in-house interviews ( F2F or Video conference  )  to the nearest amazon office .

You will meet with 5-6 Amazonians. The mix of interviewers will include managers and peers that make up the technical team.

Each meeting will be one-on-one interview sessions lasting approximately 45-60 minutes ( approx 5 hrs).

In my case it was a video conference round , below are few behavioral  questions covered in all the 5 rounds .


1) Why Amazon?
2)What is your understanding about the role?
3)What do you wish to change in your current environment?
4)What is the customer interaction that you are most proud of?
5)Describe me your most difficult customer interaction?
6)Tell me about a time you made a significant mistake . What would you have done differently?
7)Give an example of a tough or critical piece of feedback you received. What was it and what did you do about it?
8)Describe a time when you needed the cooperation of a peer or peers who were resistant to what you were trying to do. What did you do? What was the outcome?
9)Saw a Peer Struggling and what did you do to help?
10)Give me an example of when you have to make an important decision in the absence of good data because there just wasn’t any. What was the situation and how did you arrive at your decision? Did the decision turn out to be the correct one? Why or why not?
11)Tell me about a time you took a big risk. What was the situation?
12)Give me an example of a time when you were able to deliver an important project under a tight deadline. What sacrifices did you have to make to meet the deadline? How did they impact the final deliverable s?

Amazon interview questions for cloud support engineer


Interview process :

2/3 Telephonic rounds ( 45 min to 1 hr each )  and 5 back to back rounds with 5 managers ( 5 hrs ).


Ist round :

1. About the role ?
2. Linux boot process ?
3. what is GRUB ?
4. what is iptables ?
5. what is default gateway and where we can configure the same ?
6. what are all the parmeters are there in ifcfg-eth0 file ?
7. Difference between TCP and UDP ?
8. How you will check the free space ?
9. what is HDFS ?
10.file write process in haddop?
11.file read process in hadoop?
12.how to run a job in hadoop ?
13.what is loopback address ? and what is 0.0.0 in it ?
14.what is subnet masking?
15.Asked about some port no like port 22 / 25 / 53 / 80 / 110 /3306
16.what is DNS ?
17.what is DHCP how it works ?
18.what is the diffrence between NTFS / Fat32 ?
19.what is RODC ?

Container [pid=26551,containerID=container_1437800838385_0177_01_000002] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 3.3 GB of 2.1 GB virtual memory used. Killing container.

Container [pid=26551,containerID=container_1437800838385_0177_01_000002] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 3.3 GB of 2.1 GB virtual memory used. Killing container.

  Dump of the process-tree for container_1437800838385_0177_01_000002 :


From the error message, you can see that you're using more virtual memory than your current limit of 1.0gb. This can be resolved in two ways:

From the error message, you can see that you're using more virtual memory than your current limit of 1.0gb. This can be resolved in two ways:
Disable Virtual Memory Limit Checking
YARN will simply ignore the limit; in order to do this, add this to your yarn-site.xml:
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
  <description>Whether virtual memory limits will be enforced for containers.</description>
</property>

The default for this setting is true.

Increase Virtual Memory to Physical Memory Ratio

In your yarn-site.xml change this to a higher value than is currently set
<property>
  <name>yarn.nodemanager.vmem-pmem-ratio</name>
  <value>5</value>
  <description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.</description>
</property>
The default is 2.1
You could also increase the amount of physical memory you allocate to a container.

Make sure you don't forget to restart yarn after you change the config.

Clock skew too great while getting initial credentials error - Kerberos / AD Integration step

Issue  :  Clock skew too great while getting initial credentials error  - Kerberos / AD Integration step

Sol : machines are not in sync with AD server


Update ntp server with AD server 

Failing Oozie Launcher, Output data size [7,692] exceeds maximum [2,048]

    Issue :   Failing Oozie Launcher, Output data size [7,692] exceeds maximum [2,048]

Solution :

Add the property in oozie-site.xml

<property>
     <name>oozie.action.max.output.data</name>
     <value>8192</value>
</property>



After upgrade from cloudera 5.3.3 to 5.4.3 hive cli is throwing the below error

Error:

ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
                at jline.TerminalFactory.create(TerminalFactory.java:101)
                at jline.TerminalFactory.get(TerminalFactory.java:158)
                at org.apache.hive.beeline.BeeLineOpts.<init>(BeeLineOpts.java:73)
                at org.apache.hive.beeline.BeeLine.<init>(BeeLine.java:117)
                at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:469)
                at org.apache.hive.beeline.BeeLine.main(BeeLine.java:453)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
                at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
                at org.apache.hive.beeline.BeeLineOpts.<init>(BeeLineOpts.java:101)
                at org.apache.hive.beeline.BeeLine.<init>(BeeLine.java:117)
                at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:469)
                at org.apache.hive.beeline.BeeLine.main(BeeLine.java:453)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

                at org.apache.hadoop.util.RunJar.main(RunJar.java:136) 


Reason:  Hive has upgraded to Jline2 but jline 0.94 exists in the Hadoop lib.

Resolution:  Delete jline from the Hadoop lib directory

Uncaught exception in SpoolDirectorySource thread. Restart or reconfigure Flume to continue processing.

Error Message :


15/06/18 16:00:29 ERROR source.SpoolDirectorySource: FATAL: Spool Directory source xxxxxx: { spoolDir: /root/Documents/FlumeWorkSpace/amsInput/ }: Uncaught exception in SpoolDirectorySource thread. Restart or reconfigure Flume to continue processing.
java.lang.NullPointerException
        at com.interceptor.AMSLogParserInterceptor.intercept(AMSLogParserInterceptor.java:34)
        at com.interceptor.AMSLogParserInterceptor.intercept(AMSLogParserInterceptor.java:52)
        at org.apache.flume.interceptor.InterceptorChain.intercept(InterceptorChain.java:62)
        at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:146)
        at org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(SpoolDirectorySource.java:236)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Reason :

·         Because of corrupted file which you loading, problem with your custom interceptor written by you.
·         Duplicate files loading




RHEL 7 Installation


1. Select Install Red Hat Enterprise Linux 7.0




2. Choose Language


3. Select Time Zone



4. Select software as per the requirement



5. Device selection :  Here we have two options 1. automatic   2. Manual

Here i am going with automatic , select the drive and click on DONE


6. Provide Root password



Click on DONE.



Click on Reboot




Click on Finish


IP V6 Configuration in RHEL 6

Append the below lines in /etc/sysconfig/network




Similarly append the below lines in /etc/sysconfig/network-scripts/ifcfg-eth0 



Restart the network service using service network restart


Verify the IP address using ifconfig




Ping the new IP using ping6 command :




Njoy :)

device-eth0-does-not-seem-to-be-present,delaying initialization

Error:  "device-eth0-does-not-seem-to-be-present"



Cause
when I cloned one of my Linux virtual machine. So the cloned machine was not able to detect the NIC and every time I tried to restart the network it was throwing this error


Solution

The MAC id details for the NIC card is stored in the below mentioned file which is automatically created every time the machine boots.

                     /etc/udev/rules.d/70-persistent-net.rules

If you open this file the contents would be like this


Delete this file

           rm -f /etc/udev/rules.d/70-persistent-net.rules

Reboot the machine i.e  init 6

Now we can see difference in MAC id and NIC card name


Copy the new MAC id and NIC card name to your ifcfg-eth0 file

vi /etc/sysconfig/network-scripts/ifcfg-eth0



Restart the Network service using :  service network restart

check for IP details using " ifconfig"


openstack swift [ Object Storage]

  • OpenStack Swift, also known as OpenStack Object Storage, is an open source object storage system that is licensed under the Apache 2.0 license and runs on standard server hardware.
  • OpenStack Swift is best suited to backup and archive unstructured data, such as documents, images, audio and video files, email and virtual machine images.
  • Objects and files are written to multiple drives, and the Swift software ensures the data is replicated across a server cluster. By default, Swift places three copies of every object in as unique-as-possible locations -- first by region, then by zone, server and drive. 
  • If a server or hard drive fails, OpenStack Object Storage replicates its content from active nodes to new locations in the cluster. 
  • The system, which is accessed through a REST HTTP application programming interface (API), can scale horizontally to store petabytes of data through the  addition of nodes, which typically equate to servers. OpenStack Swift software is based on Cloud Files technology developed by Rackspace Hosting Inc. 
  • Rackspace and NASA initiated the project and co-founded the community that develops and maintains OpenStack software, which includes compute, storage and networking components for building cloud computing services.



 Swift Components:
 =================

 1.Proxy Server
Tie together the Swift architecture
Request routing
Exposes the public API

 2.Ring
Maps names to entities (accounts, containers, objects) on disk.
Stores data based on zones, devices, partitions, and replicas
Weights can be used to balance the distribution of partitions
Used by the Proxy Server for many background processes
 3.Object Server
Blob storage server
metadata kept in xattrs
data in binary format
Object location based on name & timestamp hash

Swift & Large Object Storage:
==============================

Default 5GB limit on the size of an uploaded object
Segmentation makes download size of a single object is virtually unlimited
Segments large object are uploaded and a special manifest file is created when downloaded, all segments are concatenated as a single object.
Greater upload speed
Possible parallel uploads of segments.

Swift Components:
=================

Replication
       Keep the system consistent, handle failures

Updaters
       Process failed or queued updates

Auditors
      Verify integrity of objects, containers, and accounts

Container Server:
      Handles listing of objects, stores as SQLite DB

Account Server:
       Handles listing of containers, stores as SQLite DB

Limitation:
===========
Search is limited to queries based on the object's name and to a single bucket. No metadata or content-based search capabilities are provided.