Frequent Issues
-
Empty HTTP API Response
Symptom
HTTP API response is OK (status 200), but there is no data in the response.
E.g. Getting an empty response for the below location URL:
Token needs to be added at http header.
https://<tenant.com>:<port>/dataeng/api/v1.0/config/location
Environment
Can occur in both Prod and QA cluster.
Possible Causes
There is no data in impala cimdata database, or locations are not loaded from SDP server.
Edge-server.log
2018-06-27T07:05:25,212 INFO (pool-6-thread-1) [CimSDPServer(refresh:219)] source.SDPServer.produs-ckc access token refreshed: 0433e389-c042-34d4-9a8e-38845a0d21a4 2018-06-27T07:05:25,777 WARN (pool-6-thread-1) [HttpClientImpl(handleResponse:330)] response code is: 401 2018-06-27T07:05:25,778 ERROR (pool-6-thread-1) [CimSDPServer(run:107)] source.SDPServer.produs-ckc SDP refresh: query SDP location api: null
Troubleshooting
• Check SDP server status.
Otherwise, perform following steps.
• Check HBase table, if HBase table has data, it means data is missed during ETL job, get the workflow log and contact Data Engine team.
• If HBase table is already mapped to cim database, user can query with impala SQL.
• If HBase does not have the data, check Kafka with Kafka consumer (utility). If kafka has data, but HBase does not, it means Flume has not put the data into HBase. User should get all of flume agents logs and send to the Data Engine Team.
Note #data-01.novalocal is a Kafka broker hostname, 9092 is Kafka service port. This command #will output all Kafka messages, and only show messages containing your expected keyword.
Kafka-console-consumer --bootstrap-server data-01.novalocal:9092 --topic cim --from-beginning | grep “your-keyword-here”
• If Kafka does not have the data, then check edge-server.log in the Edge host. If there is any error, send to Data engine team.
• If there is no error in the edge log, please check device engine http and websocket work status, to see if there is relevant data output. If not, please contact device engine/extension team.
Verification
Rerun HTTP API to check if data is coming.
Post Verification
If Core location server is in running state, but data engine still reports empty response for about an hour, raise CDETS for Data Engine.
-
HTTP API Response Login Failure Error
Symptom
Login Failure
E.g. Getting an empty response for the below location URL:
https://<tenant.com>:<port>/dataeng/api/v1.0/config/location
Internal server error: Login failure for hdfs@CIM.IVSG.AUTH from keytab /usr/local/cim/conf/hdfs.keytab
Environment
Can occur in both Prod and QA cluster.
Possible Causes
• Network Failure
• Kerberos auth failure.
• CDH not in good health condition.
Troubleshooting
• Make sure Kerberos server is running, TCP & UDP port 88 are listening on, and opened in firewall
For other errors, perform the following steps.
• Test API again in Edge host
In AWS cluster, there is a web gateway between webserver in edge host and user browser. If API works in edge host, check the gateway status.
• Check SDP Server’s connectivity.
• Check edge-server.log. Refer Issue 9 No route to Host, if there is any similar issue, fix it.
For any other errors, send the logs to the Data Engine team
• Check Cloudera cluster health status, refer What to do if many Services are in bad health ? and What to do if only one service is in bad health status? in the FAQ section.
• Check impala service with impala-shell. Check the SQL response time. SQL timeout would most likely the root cause. Also, large quantity of small files would be the root cause of the SQL timeout.
Reach out to the Data Engine team in case of deviation from above mentioned behaviour.
Verification
Rerun HTTP API to check if data is present in the response.
Post Verification
If not a network issue, raise a CDET for Data Engine.
-
HTTP API Location ID Error
Symptom
Data of the location id is not available.
data of the location id not available:[location id]
Environment
Can occur in both Prod and QA cluster.
Possible Causes
• Core location server not responding.
• CDH cluster not in good health condition.
• LocationID is not a valid one.
Troubleshooting
• Make sure locationId is a valid city id.
• Check SDP server status refer issue 1.Empty HTTP API Response.
Verification
Rerun the location api to check the response.
Post Verification
If location server is running and location id is valid, data engine response with no location result. Check CDH status is fine, raise CDETS against data engine
-
HTTP API Response Tenant Details Error
Symptom
Tenant details not appearing in location API, post data engine registration.
Environment
Can occur in both Prod and QA cluster.
Possible Causes
• Core location server not responding.
• CDH cluster not in good health condition.
• LocationID is not a valid one.
Troubleshooting
• Make sure locationId is a valid city id.
• Check SDP server status refer issue 1. Empty HTTP API Response.
Verification
Rerun the location api to check the response.
Post Verification
If location server is running, location id is valid and CDH status is fine, still data engine response has no location result then raise CDETS against data engine
-
HTTP API Response Contains unreasonable data
Symptom
Percentage value is negative, or Average, Minimum, Maximum values are unreasonable.
Environment
Can occur in both Prod and QA cluster.
Possible Causes
• Ingestion of unreasonable data.
• Invalid sensor counted in the data engine.
• Data Engine ETL failure.
Troubleshooting
• Check data from HBase table: query from cim database with impala-shell.
• If avg, min, max values are not reasonable, check the sensor count from impala-shell.
• If raw data has negative values, check with Device Engine/Extension for data input issue.
Otherwise, send the API URL and raw data DB dump for that time period to the Data Engine team.
Verification
If unreasonable data is in Raw Data Table, try TQL live to double confirm.
If sensor count is not right, check the sensor status.
Post Verification
If Device Engine provides unreasonable data, raise CDETS for Device Engine.
Check sensor status, if still no issue found, raise CDETS for Data Engine.
-
ETL Workflow is Stuck
Symptom
ETL workflow takes too long for completion, say, more than a day.
Environment
Can occur in both Prod and QA cluster.
Possible Causes
If Data Engine runs V1.1.0.3rb1, Data engine will do small file merge periodically. It may cause Impala to get stuck.
Impala memory shortage causes the query to get stuck.
Troubleshooting
If Data Engine runs V1.1.0.3rb1, upgrade data engine.
for other versions, refer How to check if Impala is stuck in the FAQ section.
Verification
Open Cloudera Manager Web console -> Impala -> Queries:
Check all query duration, it is expected that all queries would finish in 30 min.
Post Verification
If Data Engine runs V1.1.0.4 and above, raise CDETS for Data Engine.
-
ETL Workflow is Killed
Symptom
ETL workflow does not finish successfully, ETL state: killed
Environment
Can occur in both Prod and QA cluster.
Possible Causes
This may be caused due to RSH issue, Port issue or HAProxy.
RSH Issue Error Logs: java.lang.Exception: Error in checking user& group for city cimdata from host data-01.novalocal at com.cisco.cim.oozie.action.CheckUserAction.main (CheckUserAction.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:55) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:64) at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:35) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:234) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs (UserGroupInformation.java:1920) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.JavaMain], main() threw exception,
HAProxy Error Caused by: java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://cm-hue-01.novalocal:10000/;principal= hive/cm-hue-01.novalocal@CIM.IVSG.AUTH:java.net.ConnectException: Connection refused
Troubleshooting
• Get ETL job log and check the errors in it.
• If an RSH issue, run “update etl conf” action of Cim Tool Service.
• If a Port issue, refer Issue 9 No route to host.
• If HAProxy is not working, restart HAProxy.
if hosts get restarted by some unknown reason, make sure all dependent services are restarted.
Verification
Restart ETL after one ETL cycle, check the ETL status
Post Verification
If RSH is working fine and network also looks good, raise CDETS for Data Engine.
-
Failure in ETL Creation
Symptom
Error is observed when ETL creation is run from cim-tool.
Environment
Can occur in both Prod and QA cluster.
Possible Causes
This may be caused due to RSH failure.
RSH timeout Error example: java.lang.Exception: Process timeout for command:sudo -u oozie rsh -l root data-01.novalocal "rm -rf /usr/local/etl/;mkdir -p /usr/local/etl/;chown -R oozie:oozie /usr/local/etl/" at com.cisco.cim.oozie.util.CLITool.executeLinuxCommand(CLITool.java:66) at com.cisco.cim.oozie.util.CLITool.executeRSHCommandAsRootByOozie(CLITool.java:101) at com.cisco.cim.oozie.util.UserManager.syncKeytabFileToHost(UserManager.java:199) at com.cisco.cim.oozie.util.InitialOperation.main(InitialOperation.java:122)
Troubleshooting
• Run “update etl conf” action in Cim Tool Service.
• If RSH timeout exception is thrown, create ETL manually with command: #run in cm-hue host cd /opt/cloudera/parcels/cim_etl_oozie/oozie/ ./run --oozie http://oozie-install-hostname-here:11000/oozie
• If RSH is still failing, install RSH manually.
Verification
Re-run ETL creation in cim-tool to see if ETL workflow is successfully created and running.
Post Verification
If it is not a RSH issue, raise CDETS for Data Engine.
-
No Route to Host
Symptom
Such error may occur while creating ETL, or from workflow log as shown below: 2018-05-04 10:14:41,664 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.NoRouteToHostException: No Route to Host from data-01.novalocal/10.10.22.35 to data-03.novalocal: 39025 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost
Environment
Can occur in both Prod and QA cluster.
Possible Causes
Network connectivity issue or port is not opened.
Troubleshooting
• Check the list of open ports and related configuration, for example
iptables rules or /proc/sys/net/ipv4/ip_local_port_range #ip_local_port_range value should be same as this cat /proc/sys/net/ipv4/ip_local_port_range 62000 64000
Verification
Create or restart ETL to check if ETL creation or workflow finishes successfully.
Post Verification
If there is no network or configuration issue identified, raise CDETS for Data Engine.
-
LeaderNotAvailable Exception
Symptom
Kafka broker stops data ingestion and Kafka consumer throws LeaderNotAvailable Exception.
Environment
Can occur in both Prod and QA cluster.
Possible Causes
One node of Kafka Broker restarts with unclean leader election causing the issue.
It is a known issue for Kafka 0.11.x
Troubleshooting
• Restart Kafka cluster for recovery.
• Enable unclean leader election in the Kafka configuration.
Verification
Check Kafka ingestion diagram in CM after Kafka has restarted, the data should start coming in.
Post Verification
If configuration is fine, raise CDETS for Data Engine.
-
DoNotRetry IOException
Symptom
• When data is queried from some external tables of cim database, such errors may be encountered.
* The external table are tables mapped from HBase, and store too many data records.
Error Log: DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout? CAUSED BY: OutOfOrderScannerNextException: org.apache.hadoop.HBase.exceptions.OutOfOrderScannerNextException: Expected nextCallSeq: 1 But the nextCallSeq got from client: 0; request=scanner_id: 572818 number_of_rows: 1024 close_scanner: false next_call_seq: 0 client_handles_partials: true client_handles_heartbeats: true track_scan_metrics: false renew: false at org.apache.hadoop.HBase.regionserver.RSRpcServices.scan (RSRpcServices.java:2443)
Environment
Can occur in both Prod and QA cluster.
Possible Causes
RPC timeout due to configuration error.
Troubleshooting
Do following configuration update:
1. Cloudera Manager -> Impala -> Configuration -> hbase.rpc.timeout -> 3 Minutes. (The default value is 3 seconds)
2. If all hosts memory size are 32GB. Cloudera Manager -> HBase -> Configuration -> Java Heap Size of HBase RegionServer in Bytes -> 4 GB.
3. Cloudera Manager -> HBase -> Configuration -> HBase Client Scanner Caching -> 50
4. Cloudera Manager -> HBase -> Configuration -> HBase Client Scanner Timeout -> 2 min
5. Cloudera Manager-> Impala -> Configuration -> HBase RPC timeout -> 2 min
Verification
Rerun the same SQL to check if it gets completed successfully.
Post Verification
If configuration is fine, raise CDETS for Data Engine.
-
Memory Limit Exceeded
Symptom
HTTP API fails.
http-api.log file: Internal server error: Memory limit exceeded Codegen failed to reserve '5555712' bytes for optimization Internal server error: Memory limit exceeded Failed to allocate tuple buffer Internal server error: Memory limit exceeded Internal server error: Memory limit exceeded The memory limit is set too low to initialize spilling operator (id=7). The minimum required memory to spill this operator is 264.00 MB.
Environment
Can occur in both Prod and QA cluster.
Possible Causes
The Impala memory limit is set too low.
Troubleshooting
• Make sure Impala Daemon Memory limit is not less than 3.5 GB.
• If less, increase to at least 3.5 GB. If in 32 G memory configuration, set it to 8G
• Restart Impala service from CM.
Get in touch with Data Engine team to get some checks done before restarting Impala service.
Verification
Rerun the same SQL to check if it gets completed successfully.
Post Verification
If configuration is fine, raise CDETS for Data Engine.
-
Clock Offset Warn / NTP issue
Symptom
Error Message is seen in the logs: E0320 07:06:16.662436 4178 authentication.cc:160] SASL message (Kerberos (internal)): GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Clock skew too great)
Status summary shows 'Warning' for Clock Offset and one of the Hosts in bad health.
Environment
Can occur in both Prod and QA cluster.
Possible Causes
NTP service is not working correctly in the host.
Troubleshooting
Restart NTPD service in the bad health host.
Verification
Check the timeskew in the host. Then check if Cluster is recovered from the bad health.
If not, restart the cluster.
Post Verification
If recovered from NTP service interruption, no action required
-
HBase Master no active instance warn
Symptom
While upgrading, HBase hits such an error Bad: Master summary: cm-hue-01.novalocal(Availability:Unknown,Health:Good), name-01.novalocal(Availability:Unknown,Health:Good). This health test is bad because the Service Monitor did not find an active Master.
Environment
Can occur in both Prod and QA cluster.
Possible Causes
Configuration Error.
Troubleshooting
Perform following configuration changes:
• CM -> zookeeper -> configuration -> search "Enable auto-creation of data directories"
• Check on it -> click "save changes".
• Restart zookeeper instances one after other.
Verification
HBase should recover from bad health.
Post Verification
If configuration setting is correct, raise CDETS for Data Engine.
-
Kafka storage occupying too much space
Symptom
var/local/kafka occupying too much disk space on one specific node.
Environment
Can occur in both Prod and QA cluster.
Possible Causes
Two Kafka topic replica in the same node cause the over capacity.
Troubleshooting
Enable Kafka broker service on one more nodes and follow 'How to move Kafka partion from one node to another' in FAQ section to migrate one replica to that node.
Verification
HBase should recover from bad health.
Post Verification
Check /var/local/kafka size on all the nodes running Kafka broker.
-
No Write Operations on Hbase Table
Symptom
Alert received from monitoring script: “no write operations on HBase Table".
Environment
Can occur in any of the Prod and QA cluster with low disk IO performance.
Possible Causes
Due to the low disk write speed, Hbase write cannot cache up.
Troubleshooting
• Add flume log configuration log4j.appender.RFA.Threshold=INFO log4j.logger.com.cisco=DEBUG,ciscoFlm
• Add following flume configuration Run “sed -Ei 's/(.*).type = com.cisco.cim.flume.sink.CimHBaseSink/&\n\1.batchSize = 10/g' /opt/cloudera/parcels/cim_flume_plugin/conf/flume.conf” in all hosts, and restart flume service
• Add following Kafka configuration run “sed -i '/kafka.batchSize/ s/.*/kafka.batchSize = 50/' /opt/cloudera/parcels/cim_flume_plugin/conf/generateConfig.properties” in all hosts, then execute action “update flume conf” in cim-tool service and restart flume agent.
Verification
No more alert should be observed after restarting the Flume.
Post Verification
If issue still persists, raise a CDET for Data Engine.