Technology – Page 7

Increase MySQL/MariaDB max_connections online (without a restart)
MySQL/MariaDB makes DBA’s life much easier by allowing to increase the number of max_connections.

nn

You can use SHOW VARIABLE LIKE 'max_connections'; to check the number of max connections allowed.

nn
mysql> SHOW VARIABLE LIKE 'max_connections';n+-----------------+-------+n| Variable_name | Value |n+-----------------+-------+n| max_connections | 151 |n+-----------------+-------+n1 row in set (0.01 sec)n
nn

You might want to see how many connections you have in your database right now. You can use the below query to check that.

nn
mysql> SELECT COUNT(1) FROM information_schema.processlist;n+----------+n| COUNT(1) |n+----------+n| 100 |n+----------+n1 row in set (0.00 sec)n
nn

Let’s assume, we need to increase maximum connections allowed to 250. Please run below query.

nn
mysql> SET GLOBAL max_connections = 250;nQuery OK, 0 rows affected (0.00 sec)n
nn

You can verify the change by running SHOW VARIABLE LIKE 'max_connections';.

nn
mysql> SHOW VARIABLE LIKE 'max_connections';n+-----------------+-------+n| Variable_name | Value |n+-----------------+-------+n| max_connections | 250 |n+-----------------+-------+n1 row in set (0.01 sec)n
nn

But our work is not done yet. If service restarts max_connection will reset to the old value. To make the change permanent, we need to change the configuration file. Find the relevant configuration file. It can be one of below files. (Depends on your installation.)

nn
/etc/mysql/mysql.conf.d/mysqld.cnfn/etc/mysql/conf.d/mysql.cnfn/etc/mysql/mysql.cnfn/etc/mysql/my.cnfn/etc/my.cnfn
nn

Once you manage to locate the file, look for the line “max_connections = 151” and change it to “max_connections = 250”. Make sure it is not commented. (If there is a # in the begin of that line please remove that #. )

nn

Tags

nn
- database
- mysql
- mariadb
n
How to reset replication in MySQL/MariaDB RDS?
RESET SLAVE; doesn’t work in AWS MySQL RDS. You should use the below command to stop replication.

nn
CALL mysql.rds_reset_external_master;n
nn

You will see something like an error message. You can safely ignore that. Try SLAVE STATUS\G. If you get an empty set, your instance has successfully reset the replication.

nn

### Tags

nn
- mysql
- rds
- replication
- mariadb
- aws
n
Helsinki, Finland
Happy new year! Although it is quite late. I visited Helsinki. I liked it even though everyone around here says it is boring. Took a ferry. Didn’t take more than two and a half hours to go there. Walked pretty much 20km.

nn

nn

nn

nn

nn

Tags

nn
- cleithrophobe
- finland
- helsinki
- travel
n
Tallinn, Estonia
Tere!

nn

It’s been a long time. Didn’t write any blog posts for a while. So many things happened in between. I moved to Estonia. Started working on databases again. Looking forward to writing about databases and travelling.

nn

nn

Tags

nn
- cleithrophobe
- tallinn
- estonia
- travel
n
LIRNEasia and University of Dhaka Research Collaboration
LIRNEasia collaborates with University of Dhaka Data and Design lab to perform policy-related research. A research workshop was held at the University of Dhaka. Apache Spark and Hadoop sessions were done by me.

nn

nResearch Team

nn

nSriganesh sharing the experience in policy-related research

nn

nApache Hadoop Hands-on session

nn

nHadoop!!!

nn

If you are interested in Big Data Research opportunities, please check this link – https://lirneasia.net

nn

Tags

nn
- spark
- hadoop
- Big Data
- Data Science
n
Getting Public Data Sets for Data Science Projects
All of us are interested in doing brilliant things with data sets. Most people use Twitter data streams for their projects. But there a lot of free data sets on the Internet. Today, I’m going to list down a few of them. Almost all of these links, I found from a Lynda.com course called Up and Running with Public Data Sets. If you want more details, please watch the complete course on Lynda.com.

nn
- Quandl (https://www.quandl.com/)
- Inforum (http://www.inforum.umd.edu/)
- Google Public Dataset (https://www.google.com/publicdata/directory)
- Amazon Public Dataset (https://aws.amazon.com/public-data-sets/)
- US Open Data Portal (https://www.data.gov/)
- Google Ngram Viewer (https://books.google.com/ngrams)
- UK Open Data Portal (https://data.gov.uk/)
- Corpus of Contemporary American English (http://corpus.byu.edu/coca/)
- World Bank (http://data.worldbank.org/)
- UN (http://data.un.org/)
- EuroStat (http://ec.europa.eu/eurostat)
- CIA World FactBook (https://www.cia.gov/library/publications/the-world-factbook/)
- American FactFinder (http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml)
- US Census Data and Statistics (https://www.usa.gov/statistics)
- EDGAR (http://www.sec.gov/edgar/searchedgar/companysearch.html)
- FedStats (https://fedstats.sites.usa.gov/)
- Singhapore Open Data Poral (https://data.gov.sg/)
- Dublin Data Portal (http://dublinked.ie/)
- Ireland Data Portal (https://data.gov.ie/data)
- Canada Data Portal (http://open.canada.ca/en)
nn

Moreover, Quandl offers R package which you can download data into your R projects very easily. I’m hoping to write a blog post on getting Quandl data with R. This site mainly includes economic data sets.

nn

I’m from Sri Lanka. Sri Lankan researchers might need Sri Lankan data sets as well. Below links will help you to find Sri Lankan data sets.

nn
- Sri Lanka Open Data Portal (https://www.data.gov.lk)
- Sri Lanka Police (http://www.police.lk/index.php/information-technology/190)
- Department of Census and Statistics Sri Lanka (http://www.statistics.gov.lk/)
- Ministry of Education Statistics (http://www.moe.gov.lk/english/index.php?option=com_content&view=article&id=1220&Itemid=922)
nn

Hope you enjoy playing with those data sets!

nn

Tags

nn
- data
- statistics
- census
- Data Science
n

Apache Spark Job with Maven

Today, I’m going to show you how to write a sample word count application using Apache Spark. For dependency resolution and building tasks, I’m using Apache Maven. However, you can use the SBT (Simple Build Tool). Most of the Java Developers are familiar with Maven. Hence I decided to show an example using Maven.

screenshot1

This application is pretty much similar to the WordCount Example of the Hadoop. This job exactly does the same thing. Content of the Drive.scala is given below.

package org.dedunu.datascience.samplennimport org.apache.spark.{SparkContext, SparkConf}nnobject Driver {n  def main(args: Array[String]): Unit = {n    val sparkConf = new SparkConf().setAppName("Sample Job Name")n    val sparkContext = new SparkContext(sparkConf)n    val textFile = sparkContext.textFile("file://" + args(0) + "/*")n    val counts = textFile.flatMap(line => line.split(" "))n      .map(word => (word, 1))n      .reduceByKey(_ + _)n    counts.saveAsTextFile("file://" + args(1))n  }n}n

This job basically reads all the files in the input folder. Then tokenize every word from space (“ ”). Then count each and every word individually. Moreover, you can see that the application is reading arguments from the args variable. The first argument will be the input folder. The second argument will be used to dump the output.

Maven projects need a pom.xml. Content of the pom.xml is given below.

<?xml version="1.0" encoding="UTF-8"?>n<project xmlns="http://maven.apache.org/POM/4.0.0"n         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"n         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">n    <modelVersion>4.0.0</modelVersion>nn    <groupId>org.dedunu.datascience</groupId>n    <artifactId>sample</artifactId>n    <version>1.0-SNAPSHOT</version>n    <packaging>jar</packaging>nn    <properties>n        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>n        <spark.version>1.6.1</spark.version>n    </properties>nn    <dependencies>n        <dependency>n            <groupId>org.apache.spark</groupId>n            <artifactId>spark-core_2.10</artifactId>n            <version>${spark.version}</version>n            <scope>provided</scope>n        </dependency>n    </dependencies>nn    <build>n        <finalName>sample-Spark-Job</finalName>n        <plugins>n            <plugin>n                <artifactId>maven-assembly-plugin</artifactId>n                <configuration>n                    <archive>n                        <manifest>n                            <mainClass>org.dedunu.datascience.sample.Driver</mainClass>n                        </manifest>n                    </archive>n                    <descriptorRefs>n                        <descriptorRef>jar-with-dependencies</descriptorRef>n                    </descriptorRefs>n                </configuration>n                <executions>n                    <execution>n                        <id>make-assembly</id> <!-- this is used for inheritance merges -->n                        <phase>package</phase> <!-- bind to the packaging phase -->n                        <goals>n                            <goal>single</goal>n                        </goals>n                    </execution>n                </executions>n            </plugin>n            <plugin>n                <groupId>org.scala-tools</groupId>n                <artifactId>maven-scala-plugin</artifactId>n                <executions>n                    <execution>n                        <id>compile</id>n                        <goals>n                            <goal>compile</goal>n                        </goals>n                        <phase>compile</phase>n                    </execution>n                    <execution>n                        <id>test-compile</id>n                        <goals>n                            <goal>testCompile</goal>n                        </goals>n                        <phase>test-compile</phase>n                    </execution>n                    <execution>n                        <phase>process-resources</phase>n                        <goals>n                            <goal>compile</goal>n                        </goals>n                    </execution>n                </executions>n            </plugin>n        </plugins>n    </build>n</project>n

Run below command to build the Maven project:

$ mvn clean packagen

Maven will download all the dependencies and all the other stuff on behalf of you. Then what you need to do is run this job. To run the job, please run below command on your terminal window.

$ /home/dedunu/bin/spark-1.6.1/bin/spark-submit        \n     --class org.dedunu.datascience.sample.Driver      \n     target/sample-Spark-Job-jar-with-dependencies.jar \n     /home/dedunu/input                                \n     /home/dedunu/outputn

screenshot2

screenshot3

The output of the job will look like below.

screen4

You can find the project on Github – https://github.com/dedunu/spark-example

Enjoy Spark!

Gists

https://gist.github.com/dedunumax/749a390d97fd20bf0991402c0b29d7a2

https://gist.github.com/dedunumax/bcf6282d76341ea7b4ae1dd644b46245

vboxdrv setup says make not found

After you update the kernel you need to run vboxdrv setup. But if you are trying to compile it for the first time or after removing the build-essential package, you might see the below error.

$ sudo /etc/init.d/vboxdrv setupn[sudo] password for user:nStopping VirtualBox kernel modules ...done.nRecompiling VirtualBox kernel modules ...failed!n  (Look at /var/log/vbox-install.log to find out what went wrong)n$ cat /var/log/vbox-install.logn/usr/share/virtualbox/src/vboxhost/build_in_tmp: 62: n/usr/share/virtualbox/src/vboxhost/build_in_tmp: make: not foundn/usr/share/virtualbox/src/vboxhost/build_in_tmp: 62: n/usr/share/virtualbox/src/vboxhost/build_in_tmp: make: not foundn/usr/share/virtualbox/src/vboxhost/build_in_tmp: 62: n/usr/share/virtualbox/src/vboxhost/build_in_tmp: make: not foundn

Ubuntu needs build-essential to run above command. Run below command to install the build-essential.

$ sudo apt-get install build-essentailn$ sudo /etc/init.d/vboxdrv setupn

Then you can use the virtual box!

How to create an EMR cluster using Boto3?

I wrote a blog post about Boto2 and EMR clusters a few months ago. Today I’m going to show how to create EMR clusters using Boto3. Boto3 documentation is available at here.

import boto3nn__author__ = 'dedunu'nnconnection = boto3.client(n    'emr',n    region_name='us-west-1',n    aws_access_key_id='<Your AWS Access Key>',n    aws_secret_access_key='<You AWS Secred Key>',n)nncluster_id = connection.run_job_flow(n    Name='test_emr_job_with_boto3',n    LogUri='s3://<your s3 location>',n    ReleaseLabel='emr-4.2.0',n    Instances={n        'InstanceGroups': [n            {n                'Name': "Master nodes",n                'Market': 'ON_DEMAND',n                'InstanceRole': 'MASTER',n                'InstanceType': 'm1.large',n                'InstanceCount': 1,n            },n            {n                'Name': "Slave nodes",n                'Market': 'ON_DEMAND',n                'InstanceRole': 'CORE',n                'InstanceType': 'm1.large',n                'InstanceCount': 2,n            }n        ],n        'Ec2KeyName': '<Ec2 Keyname>',n        'KeepJobFlowAliveWhenNoSteps': True,n        'TerminationProtected': False,n        'Ec2SubnetId': '<Your Subnet ID>',n    },n    Steps=[],n    VisibleToAllUsers=True,n    JobFlowRole='EMR_EC2_DefaultRole',n    ServiceRole='EMR_DefaultRole',n    Tags=[n        {n            'Key': 'tag_name_1',n            'Value': 'tab_value_1',n        },n        {n            'Key': 'tag_name_2',n            'Value': 'tag_value_2',n        },n    ],n)nnprint (cluster_id['JobFlowId'])n

Gists

https://gist.github.com/dedunumax/5491fa5430427626ef47

HDFS – How to recover corrupted HDFS metadata in Hadoop 1.2.X?

You might have Hadoop in your production. And sometimes Tera-bytes of data is residing in Hadoop. HDFS metadata can get corrupted. Namenode won’t start in such cases. When you check Namenode logs you might see exceptions.

ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFExceptionn    at java.io.DataInputStream.readFully(DataInputStream.java:178)n    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)n    at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)n    at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)n    at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759)n    at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)n    at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)n    at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)n    at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)n    at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:235)n    at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)n    at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:176)n    at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:162)n    at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)n    at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)n

If you have a development environment, you can always format the HDFS and continue.

n

BUT IF YOU FORMAT HDFS YOU lose ALL THE FILES IN HDFS!!!

n

So Hadoop Administrators can’t format HDFS simply. But you can recover your HDFS to last checkpoint. You might lose some data files. But more than 90% of the data might be safe. Let’s see how to recover corrupted HDFS metadata.

Hadoop is creating checkpoints periodically in Namenode folder. You might see three folders in Namenode directory.

There are :

currentnimagenprevious.checkpointn

The current folder must be corrupted most probably.

Stop all the Hadoop services from all the nodes.

Backup both “current” and “previous.checkpoint” directories.

Delete “current” directory.

Rename “previous.checkpoint” to “current”

Restart Hadoop services.

Steps I followed I have mentioned above. Below commands were run to recover the HDFS. Commands might slightly change depending on your installation.

$ /usr/local/hadoop/stop-all.shn$ cd <namenode.dir>n$ cp -r current current.oldn$ cp -r previous.checkpoint previous.checkpoint.oldn$ mv previous.checkpoint currentn$ /usr/local/hadoop/start-all.shn

That’s all!!!! Everything was okay after that!