Daily tasks - RatneshKumarSrivastava/Ratnesh GitHub Wiki
hive : UDF
http://www.bigdata-careers.com/?page_id=63
http://bangalore.startups-list.com/startups/big_data%20analytics
β> we can load multiple files into hive table .
load data local inpathβ/home/training/Desktop/RATNESH/rkβ overwrite into table tez;
https://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
https://acadgild.com/blog/apache-hive-file-formats/
Sequence files are flat files consisting of binary key-value pairs.
Now to load data into this table is somewhat different from loading into the table created using TEXTFILE format. You need to insert the data from another table because this SEQUENCEFILE format is binary format. It compresses the data and then stores it into the table. If you want to load directly as in TEXTFILE format that is not possible because we cannot insert the compressed files into tables.
SerializerDeserializer (SerDe) for sqeunce file format:
-βββββββββββββββββββββββββββ
There are two SerDe for SequenceFile as follows:
TextSerializerDeserializer: This class can read and write data in plain text file format.
BinarySerializerDeserializer: This class can read and write data in binary file format
There are three SequenceFile Writers based on the SequenceFile.CompressionType used to compress key/value pairs:
Writer : Uncompressed records.
RecordCompressWriter : Record-compressed files, only compress values.
BlockCompressWriter : Block-compressed files, both keys & values are collected in βblocksβ separately and compressed. The size of the βblockβ is configurable.
The actual compression algorithm used to compress key and/or values can be specified by using the appropriate CompressionCodec.
Enabling Compression for SequenceFile Tablesββ-
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> insert overwrite table new_table select * from old_table;
for partiiton table-
hive> create table new_table (your_cols) partitioned by (partition_cols) stored as new_format;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> insert overwrite table new_table partition(comma_separated_partition_cols) select * from old_table;
To complete a similar process for a table that includes partitions, you would specify settings similar to the following:
hive> CREATE TABLE tbl_seq (int_col INT, string_col STRING) PARTITIONED BY (year INT) STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.max.split.size=256000000;
hive> SET mapred.output.compression.type=BLOCK;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.dynamic.partition=true;
hive> INSERT OVERWRITE TABLE tbl_seq PARTITION SELECT * FROM tbl;
-ββββββββββββββββββββββββββββββββββββββββββββββββββββ
RC FIle format :
org.apache.hadoop.hive.ql.io.RCFileInputFormat
org.apache.hadoop.hive.ql.io.RCFileOutputFormat
and 4. The term βindexβ is rather inappropriate. Basically itβs just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE requirements, drastically reducing I/O in some cases (a trick that has become popular in columns stores e.g. InfoBright on MySQL, but also in Oracle Exadata appliances [dubbed βsmart scanβ by Oracle marketing])
-βββββββββββββββββββββββββββββββββββββββββββββββ
diffence btw orc and rc file formatβ
ORC offers a number of features not available in RC files:
- Better encoding of data. Integer values are run length encoded. Strings and dates are stored in a dictionary (and the resulting pointers then run length encoded).
- Internal indexes and statistics on the data. This allows for more efficient reading of the data as well as skipping of sections of the data not relevant to a given query. These indexes can also be used by the Hive optimizer to help plan query execution.
- Predicate push down for some predicates. For example, in the query βselect * from user where state = βcaββ, ORC could look at a collection of rows and use the indexes to see that no rows in that group have that value, and thus skip the group altogether.
- Tight integration with Hiveβs vectorized execution, which produces much faster processing of rows
- Support for new ACID features in Hive (transactional insert, update, and delete).
- It has a much faster read time than RCFile and compresses much more efficiently.
Whether ORC is the best format for what youβre doing depends on the data youβre storing and how you are querying it. If you are storing data where you know the schema and you are doing analytic type queries itβs the best choice (in fairness, some would dispute this and choose Parquet, though much of what I said above about ORC vs RC applies to Parquet as well). If you are doing queries that select the whole row each time columnar formats like ORC wonβt be your friend. Also, if you are storing self structured data such as JSON or Avro you may find text or Avro storage to be a better format.
-ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ-Spark Spark can resume data Load but hive cannt.
Spark can handle encrypted data
-ββββββββββββββββββββββββββββββββββββββββββββββ-
secorndary nameNode ::
As the NameNode is the single point of failure in HDFS, if NameNode fails entire HDFS file system is lost. So in order to overcome this, Hadoop implemented Secondary NameNode whose main function is to store a copy of FsImage file and edits log file.
FsImage is a snapshot of the HDFS file system metadata at a certain point of time and EditLog is a transaction log which contains records for every change that occurs to file system metadata. So, at any point of time, applying edits log records to FsImage (recently saved copy) will give the current status of FsImage, i.e. file system metadata.
Whenever a NameNode is restarted, the latest status of FsImage is built by applying edits records on last saved copy of FsImage. Since, NameNode merges FsImage and EditLog files only at start up, the edits log file might get very large over time on a busy cluster. That means, if the EditLog is very large, NameNode restart process result in some considerable delay in the availability of file system. So, it is important keep the edits log as small as possible which is one of the main functions of Secondary NameNode.
Secondary NameNode Functions:
1. Stores a copy of FsImage file and edits log.
2. Periodically applies edits log records to FsImage file and refreshes the edits log. And sends this updated FsImage file to NameNode so that NameNode doesnβt need to re-apply the EditLog records during its start up process. Thus Secondary NameNode makes NameNode start up process fast.
3. If NameNode is failed, File System metadata can be recovered from the last saved FsImage on the Secondary NameNode but Secondary NameNode canβt take the primary NameNodeβs functionality.
4. Check pointing of File system metadata is performed.
how to set default file in hive :
External hive.default.fileformat
Managed hive.default.fileformat.managed
5. HiveServer2 logs filled with βjavax.security.sasl.SaslException: GSS initiate failedβ
2016-12-22 16:36:49,643 WARN ipc.Client (Client.java:run(685)) β Exception encountered while connecting to the server :
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:563)
at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:378)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:732)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:728)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:727)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:378)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1492)
at org.apache.hadoop.ipc.Client.call(Client.java:1402)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy23.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:773)
at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2162)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1363)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1359)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1359)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.ranger.audit.destination.HDFSAuditDestination.getLogFileStream(HDFSAuditDestination.java:226)
at org.apache.ranger.audit.destination.HDFSAuditDestination.logJSON(HDFSAuditDestination.java:123)
at org.apache.ranger.audit.queue.AuditFileSpool.sendEvent(AuditFileSpool.java:890)
at org.apache.ranger.audit.queue.AuditFileSpool.runDoAs(AuditFileSpool.java:838)
at org.apache.ranger.audit.queue.AuditFileSpool$2.run(AuditFileSpool.java:759)
at org.apache.ranger.audit.queue.AuditFileSpool$2.run(AuditFileSpool.java:757)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689)
at org.apache.ranger.audit.queue.AuditFileSpool.run(AuditFileSpool.java:765)
at java.lang.Thread.run(Thread.java:745)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
ROOT CAUSE:
hiveserver2 configured with ranger plugin which writes hdfs audit event to both database as well as hdfs. the hiveserver2 thread hiveServer2.async.multi_dest.batch_hiveServer2.async.multi_dest.batch.hdfs_destWriter is trying to write audit events on hdfs but due to TGT got expired.
WORKAROUND:
disable audit events write on hdfs.
RESOLUTION:
this has been fixed in https://issues.apache.org/jira/browse/RANGER-1136, so apply a patch to avoid this.