R Programming

What is R Programing

R is a programming language and software environment for statistical analysis, graphics representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions. R allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.

R is open source distributed by GNU,is widely used for statistical programming,Business intelligence,graphical presentation and reporting.it is #1 choice for data Scientist for data analyst and representation.Also R is good for Statistical calculation  on arrays, lists, vectors and matrices.A tool for DATA analysis.

Local Environment Setup

R can be install on Window and Linux.

Window Environment : https://cran.r-project.org/bin/windows/base

Linux Installation :https://cran.r-project.org/bin/linux/

Type ‘demo()’ for some demos, ‘help()’ for on-line help, or
‘help.start()’ for an HTML browser interface to help.
Type ‘q()’ to quit R.

at R prompt to install the required package. For example, the following command will install plotrix package which is required for 3D charts.

> install.packages("plotrix")

R Command Prompt

When local Environment is set up then you can start using R by typing this command

$ R , this will launch R interpreter with > prompt’… then you can type your command.

> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"                                                                   

first statement defines a string variable myString, where we assign (<-) a string “Hello, World!” and then next statement print() is being used to print the value stored in variable myString

R Script File
Mostly R program is written in Scripfile then execute it at the Window or LInux command prompt
by R interpreter called Rscript.

Let write a Scriptfile
# My program in R Programming
myString <- “Hello, World!”
print ( myString)

Save the above code in a file hello.R and execute it,
execute it at Window command prompt

$ Rscript hello.R

it produces the following result.

[1] “Hello, World!”

Comments
Comments are like helping text in your R program and will be ignored by the interpreter while executing your program. Single comment is written using # in the beginning of the statement as follows

# My first program in R Programming

Next Blog will be about R Data Types
Vectors, Lists, Matrices, Arrays, Factors and Data Frames

MongoDB

THE QUICK GUIDE

MongoDB is a cross-platform, document oriented database that provides, high performance, high availability, and easy scalability. MongoDB works on concept of collection and document.

Database

Database is a physical container for collections. Each database gets its own set of files on the file system. A single MongoDB server typically has multiple databases.

Collection

Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection exists within a single database. Collections do not enforce a schema. Documents within a collection can have different fields. Typically, all documents in a collection are of similar or related purpose.

Document

A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.

The following table shows the relationship of RDBMS terminology with MongoDB.

RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents
Primary Key Primary Key (Default key _id provided by mongodb itself)
Database Server and Client
Mysqld/Oracle mongod
mysql/sqlplus mongo

Sample Document

Following example shows the document structure of a blog site, which is simply a comma separated key value pair.

{
   _id: ObjectId(7df78ad8902c)
   title: 'MongoDB Overview', 
   description: 'MongoDB is no sql database',
   by: 'tutorials point',
   url: 'http://www.tutorialspoint.com',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 100, 
   comments: [	
      {
         user:'user1',
         message: 'My first comment',
         dateCreated: new Date(2011,1,20,2,15),
         like: 0 
      },
      {
         user:'user2',
         message: 'My second comments',
         dateCreated: new Date(2011,1,25,7,45),
         like: 5
      }
   ]
}

_id is a 12 bytes hexadecimal number which assures the uniqueness of every document. You can provide _id while inserting the document. If you don’t provide then MongoDB provides a unique id for every document. These 12 bytes first 4 bytes for the current timestamp, next 3 bytes for machine id, next 2 bytes for process id of MongoDB server and remaining 3 bytes are simple incremental VALUE.

MongoDB – Advantages

Any relational database has a typical schema design that shows number of tables and the relationship between these tables. While in MongoDB, there is no concept of relationship.

Advantages of MongoDB over RDBMS

  • Schema less − MongoDB is a document database in which one collection holds different documents. Number of fields, content and size of the document can differ from one document to another.
  • Structure of a single object is clear.
  • No complex joins.
  • Deep query-ability. MongoDB supports dynamic queries on documents using a document-based query language that’s nearly as powerful as SQL.
  • Tuning.
  • Ease of scale-out − MongoDB is easy to scale.
  • Conversion/mapping of application objects to database objects not needed.
  • Uses internal memory for storing the (windowed) working set, enabling faster access of data.

Why Use MongoDB?

  • Document Oriented Storage − Data is stored in the form of JSON style documents.
  • Index on any attribute
  • Replication and high availability
  • Auto-sharding
  • Rich queries
  • Fast in-place updates
  • Professional support by MongoDB

Where to Use MongoDB?

  • Big Data
  • Content Management and Delivery
  • Mobile and Social Infrastructure
  • User Data Management
  • Data Hub

MongoDB – Environment

Let us now see how to install MongoDB on Windows.

Install MongoDB On Windows

To install MongoDB on Windows, first download the latest release of MongoDB from https://www.mongodb.org/downloads. Make sure you get correct version of MongoDB depending upon your Windows version. To get your Windows version, open command prompt and execute the following command.

C:\>wmic os get osarchitecture
OSArchitecture
64-bit
C:\>

32-bit versions of MongoDB only support databases smaller than 2GB and suitable only for testing and evaluation purposes.

Now extract your downloaded file to c:\ drive or any other location. Make sure the name of the extracted folder is mongodb-win32-i386-[version] or mongodb-win32-x86_64-[version]. Here [version] is the version of MongoDB download.

Next, open the command prompt and run the following command.

C:\>move mongodb-win64-* mongodb
   1 dir(s) moved.
C:\>

In case you have extracted the MongoDB at different location, then go to that path by using command cd FOLDER/DIR and now run the above given process.

MongoDB requires a data folder to store its files. The default location for the MongoDB data directory is c:\data\db. So you need to create this folder using the Command Prompt. Execute the following command sequence.

C:\>md data
C:\md data\db

If you have to install the MongoDB at a different location, then you need to specify an alternate path for \data\db by setting the path dbpath in mongod.exe. For the same, issue the following commands.

In the command prompt, navigate to the bin directory present in the MongoDB installation folder. Suppose my installation folder is D:\set up\mongodb

C:\Users\XYZ>d:
D:\>cd "set up"
D:\set up>cd mongodb
D:\set up\mongodb>cd bin
D:\set up\mongodb\bin>mongod.exe --dbpath "d:\set up\mongodb\data" 

This will show waiting for connections message on the console output, which indicates that the mongod.exe process is running successfully.

Now to run the MongoDB, you need to open another command prompt and issue the following command.

D:\set up\mongodb\bin>mongo.exe
MongoDB shell version: 2.4.6
connecting to: test
>db.test.save( { a: 1 } )
>db.test.find()
{ "_id" : ObjectId(5879b0f65a56a454), "a" : 1 }
>

This will show that MongoDB is installed and run successfully. Next time when you run MongoDB, you need to issue only commands.

D:\set up\mongodb\bin>mongod.exe --dbpath "d:\set up\mongodb\data" 
D:\set up\mongodb\bin>mongo.exe

Install MongoDB on Ubuntu

Run the following command to import the MongoDB public GPG key −

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10

Create a /etc/apt/sources.list.d/mongodb.list file using the following command.

echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' 
   | sudo tee /etc/apt/sources.list.d/mongodb.list

Now issue the following command to update the repository −

sudo apt-get update

Next install the MongoDB by using the following command −

apt-get install mongodb-10gen = 2.2.3

In the above installation, 2.2.3 is currently released MongoDB version. Make sure to install the latest version always. Now MongoDB is installed successfully.

Start MongoDB

sudo service mongodb start

Stop MongoDB

sudo service mongodb stop

Restart MongoDB

sudo service mongodb restart

To use MongoDB run the following command.

mongo

This will connect you to running MongoDB instance.

MongoDB Help

To get a list of commands, type db.help() in MongoDB client. This will give you a list of commands as shown in the following screenshot.

DB Help

MongoDB Statistics

To get stats about MongoDB server, type the command db.stats() in MongoDB client. This will show the database name, number of collection and documents in the database. Output of the command is shown in the following screenshot.

DB Stats

MongoDB – Data Modelling

Data in MongoDB has a flexible schema.documents in the same collection. They do not need to have the same set of fields or structure, and common fields in a collection’s documents may hold different types of data.

Some considerations while designing Schema in MongoDB

  • Design your schema according to user requirements.
  • Combine objects into one document if you will use them together. Otherwise separate them (but make sure there should not be need of joins).
  • Duplicate the data (but limited) because disk space is cheap as compare to compute time.
  • Do joins while write, not on read.
  • Optimize your schema for most frequent use cases.
  • Do complex aggregation in the schema.

Example

Suppose a client needs a database design for his blog/website and see the differences between RDBMS and MongoDB schema design. Website has the following requirements.

  • Every post has the unique title, description and url.
  • Every post can have one or more tags.
  • Every post has the name of its publisher and total number of likes.
  • Every post has comments given by users along with their name, message, data-time and likes.
  • On each post, there can be zero or more comments.

In RDBMS schema, design for above requirements will have minimum three tables.

RDBMS Schema Design

While in MongoDB schema, design will have one collection post and the following structure −

{
   _id: POST_ID
   title: TITLE_OF_POST, 
   description: POST_DESCRIPTION,
   by: POST_BY,
   url: URL_OF_POST,
   tags: [TAG1, TAG2, TAG3],
   likes: TOTAL_LIKES, 
   comments: [	
      {
         user:'COMMENT_BY',
         message: TEXT,
         dateCreated: DATE_TIME,
         like: LIKES 
      },
      {
         user:'COMMENT_BY',
         message: TEXT,
         dateCreated: DATE_TIME,
         like: LIKES
      }
   ]
}

So while showing the data, in RDBMS you need to join three tables and in MongoDB, data will be shown from one collection only.

MongoDB – Create Database

In this chapter, we will see how to create a database in MongoDB.

The use Command

MongoDB use DATABASE_NAME is used to create database. The command will create a new database if it doesn’t exist, otherwise it will return the existing database.

Syntax

Basic syntax of use DATABASE statement is as follows −

use DATABASE_NAME

Example

If you want to create a database with name <mydb>, then use DATABASEstatement would be as follows −

>use mydb
switched to db mydb

To check your currently selected database, use the command db

>db
mydb

If you want to check your databases list, use the command show dbs.

>show dbs
local     0.78125GB
test      0.23012GB

Your created database (mydb) is not present in list. To display database, you need to insert at least one document into it.

>db.movie.insert({"name":"tutorials point"})
>show dbs
local      0.78125GB
mydb       0.23012GB
test       0.23012GB

In MongoDB default database is test. If you didn’t create any database, then collections will be stored in test database.

MongoDB – Drop Database

In this chapter, we will see how to drop a database using MongoDB command.

The dropDatabase() Method

MongoDB db.dropDatabase() command is used to drop a existing database.

Syntax

Basic syntax of dropDatabase() command is as follows −

db.dropDatabase()

This will delete the selected database. If you have not selected any database, then it will delete default ‘test’ database.

Example

First, check the list of available databases by using the command, show dbs.

>show dbs
local      0.78125GB
mydb       0.23012GB
test       0.23012GB
>

If you want to delete new database <mydb>, then dropDatabase()command would be as follows −

>use mydb
switched to db mydb
>db.dropDatabase()
>{ "dropped" : "mydb", "ok" : 1 }
>

Now check list of databases.

>show dbs
local      0.78125GB
test       0.23012GB
>

MongoDB – Create Collection

In this chapter, we will see how to create a collection using MongoDB.

The createCollection() Method

MongoDB db.createCollection(name, options) is used to create collection.

Syntax

Basic syntax of createCollection() command is as follows −

db.createCollection(name, options)

In the command, name is name of collection to be created. Options is a document and is used to specify configuration of collection.

Parameter Type Description
Name String Name of the collection to be created
Options Document (Optional) Specify options about memory size and indexing

Options parameter is optional, so you need to specify only the name of the collection. Following is the list of options you can use −

Field Type Description
capped Boolean (Optional) If true, enables a capped collection. Capped collection is a fixed size collection that automatically overwrites its oldest entries when it reaches its maximum size. If you specify true, you need to specify size parameter also.
autoIndexId Boolean (Optional) If true, automatically create index on _id field.s Default value is false.
size number (Optional) Specifies a maximum size in bytes for a capped collection. If capped is true, then you need to specify this field also.
max number (Optional) Specifies the maximum number of documents allowed in the capped collection.

While inserting the document, MongoDB first checks size field of capped collection, then it checks max field.

Examples

Basic syntax of createCollection() method without options is as follows −

>use test
switched to db test
>db.createCollection("mycollection")
{ "ok" : 1 }
>

You can check the created collection by using the command show collections.

>show collections
mycollection
system.indexes

The following example shows the syntax of createCollection() method with few important options −

>db.createCollection("mycol", { capped : true, autoIndexId : true, size : 
   6142800, max : 10000 } )
{ "ok" : 1 }
>

In MongoDB, you don’t need to create collection. MongoDB creates collection automatically, when you insert some document.

>db.tutorialspoint.insert({"name" : "tutorialspoint"})
>show collections
mycol
mycollection
system.indexes
tutorialspoint
>

MongoDB – Drop Collection

In this chapter, we will see how to drop a collection using MongoDB.

The drop() Method

MongoDB’s db.collection.drop() is used to drop a collection from the database.

Syntax

Basic syntax of drop() command is as follows −

db.COLLECTION_NAME.drop()

Example

First, check the available collections into your database mydb.

>use mydb
switched to db mydb
>show collections
mycol
mycollection
system.indexes
tutorialspoint
>

Now drop the collection with the name mycollection.

>db.mycollection.drop()
true
>

Again check the list of collections into database.

>show collections
mycol
system.indexes
tutorialspoint
>

drop() method will return true, if the selected collection is dropped successfully, otherwise it will return false.

MongoDB – Datatypes

MongoDB supports many datatypes. Some of them are −

  • String − This is the most commonly used datatype to store the data. String in MongoDB must be UTF-8 valid.
  • Integer − This type is used to store a numerical value. Integer can be 32 bit or 64 bit depending upon your server.
  • Boolean − This type is used to store a boolean (true/ false) value.
  • Double − This type is used to store floating point values.
  • Min/ Max keys − This type is used to compare a value against the lowest and highest BSON elements.
  • Arrays − This type is used to store arrays or list or multiple values into one key.
  • Timestamp − ctimestamp. This can be handy for recording when a document has been modified or added.
  • Object − This datatype is used for embedded documents.
  • Null − This type is used to store a Null value.
  • Symbol − This datatype is used identically to a string; however, it’s generally reserved for languages that use a specific symbol type.
  • Date − This datatype is used to store the current date or time in UNIX time format. You can specify your own date time by creating object of Date and passing day, month, year into it.
  • Object ID − This datatype is used to store the document’s ID.
  • Binary data − This datatype is used to store binary data.
  • Code − This datatype is used to store JavaScript code into the document.
  • Regular expression − This datatype is used to store regular expression.

MongoDB – Insert Document

In this chapter, we will learn how to insert document in MongoDB collection.

The insert() Method

To insert data into MongoDB collection, you need to use MongoDB’s insert()or save() method.

Syntax

The basic syntax of insert() command is as follows −

>db.COLLECTION_NAME.insert(document)

Example

>db.mycol.insert({
   _id: ObjectId(7df78ad8902c),
   title: 'MongoDB Overview', 
   description: 'MongoDB is no sql database',
   by: 'tutorials point',
   url: 'http://www.tutorialspoint.com',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 100
})

Here mycol is our collection name, as created in the previous chapter. If the collection doesn’t exist in the database, then MongoDB will create this collection and then insert a document into it.

In the inserted document, if we don’t specify the _id parameter, then MongoDB assigns a unique ObjectId for this document.

_id is 12 bytes hexadecimal number unique for every document in a collection. 12 bytes are divided as follows −

_id: ObjectId(4 bytes timestamp, 3 bytes machine id, 2 bytes process id, 
   3 bytes incrementer)

To insert multiple documents in a single query, you can pass an array of documents in insert() command.

Example

>db.post.insert([
   {
      title: 'MongoDB Overview', 
      description: 'MongoDB is no sql database',
      by: 'tutorials point',
      url: 'http://www.tutorialspoint.com',
      tags: ['mongodb', 'database', 'NoSQL'],
      likes: 100
   },
	
   {
      title: 'NoSQL Database', 
      description: "NoSQL database doesn't have tables",
      by: 'tutorials point',
      url: 'http://www.tutorialspoint.com',
      tags: ['mongodb', 'database', 'NoSQL'],
      likes: 20, 
      comments: [	
         {
            user:'user1',
            message: 'My first comment',
            dateCreated: new Date(2013,11,10,2,35),
            like: 0 
         }
      ]
   }
])

To insert the document you can use db.post.save(document) also. If you don’t specify _id in the document then save() method will work same as insert() method. If you specify _id then it will replace whole data of document containing _id as specified in save() method.

MongoDB – Query Document

In this chapter, we will learn how to query document from MongoDB collection.

The find() Method

To query data from MongoDB collection, you need to use MongoDB’s find()method.

Syntax

The basic syntax of find() method is as follows −

>db.COLLECTION_NAME.find()

find() method will display all the documents in a non-structured way.

The pretty() Method

To display the results in a formatted way, you can use pretty() method.

Syntax

>db.mycol.find().pretty()

Example

>db.mycol.find().pretty()
{
   "_id": ObjectId(7df78ad8902c),
   "title": "MongoDB Overview", 
   "description": "MongoDB is no sql database",
   "by": "tutorials point",
   "url": "http://www.tutorialspoint.com",
   "tags": ["mongodb", "database", "NoSQL"],
   "likes": "100"
}
>

Apart from find() method, there is findOne() method, that returns only one document.

RDBMS Where Clause Equivalents in MongoDB

To query the document on the basis of some condition, you can use following operations.

Operation Syntax Example RDBMS Equivalent
Equality {<key>:<value>} db.mycol.find({“by”:”tutorials point”}).pretty() where by = ‘tutorials point’
Less Than {<key>:{$lt:<value>}} db.mycol.find({“likes”:{$lt:50}}).pretty() where likes < 50
Less Than Equals {<key>:{$lte:<value>}} db.mycol.find({“likes”:{$lte:50}}).pretty() where likes <= 50
Greater Than {<key>:{$gt:<value>}} db.mycol.find({“likes”:{$gt:50}}).pretty() where likes > 50
Greater Than Equals {<key>:{$gte:<value>}} db.mycol.find({“likes”:{$gte:50}}).pretty() where likes >= 50
Not Equals {<key>:{$ne:<value>}} db.mycol.find({“likes”:{$ne:50}}).pretty() where likes != 50

AND in MongoDB

Syntax

In the find() method, if you pass multiple keys by separating them by ‘,’ then MongoDB treats it as AND condition. Following is the basic syntax of AND −

>db.mycol.find(
   {
      $and: [
         {key1: value1}, {key2:value2}
      ]
   }
).pretty()

Example

Following example will show all the tutorials written by ‘tutorials point’ and whose title is ‘MongoDB Overview’.

>db.mycol.find({$and:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty() {
   "_id": ObjectId(7df78ad8902c),
   "title": "MongoDB Overview", 
   "description": "MongoDB is no sql database",
   "by": "tutorials point",
   "url": "http://www.tutorialspoint.com",
   "tags": ["mongodb", "database", "NoSQL"],
   "likes": "100"
}

For the above given example, equivalent where clause will be ‘ where by = ‘tutorials point’ AND title = ‘MongoDB Overview’ ‘. You can pass any number of key, value pairs in find clause.

OR in MongoDB

Syntax

To query documents based on the OR condition, you need to use $or keyword. Following is the basic syntax of OR −

>db.mycol.find(
   {
      $or: [
         {key1: value1}, {key2:value2}
      ]
   }
).pretty()

Example

Following example will show all the tutorials written by ‘tutorials point’ or whose title is ‘MongoDB Overview’.

>db.mycol.find({$or:[{"by":"tutorials point"},{"title": "MongoDB Overview"}]}).pretty()
{
   "_id": ObjectId(7df78ad8902c),
   "title": "MongoDB Overview", 
   "description": "MongoDB is no sql database",
   "by": "tutorials point",
   "url": "http://www.tutorialspoint.com",
   "tags": ["mongodb", "database", "NoSQL"],
   "likes": "100"
}
>

Using AND and OR Together

Example

The following example will show the documents that have likes greater than 10 and whose title is either ‘MongoDB Overview’ or by is ‘tutorials point’. Equivalent SQL where clause is ‘where likes>10 AND (by = ‘tutorials point’ OR title = ‘MongoDB Overview’)’

>db.mycol.find({"likes": {$gt:10}, $or: [{"by": "tutorials point"},
   {"title": "MongoDB Overview"}]}).pretty()
{
   "_id": ObjectId(7df78ad8902c),
   "title": "MongoDB Overview", 
   "description": "MongoDB is no sql database",
   "by": "tutorials point",
   "url": "http://www.tutorialspoint.com",
   "tags": ["mongodb", "database", "NoSQL"],
   "likes": "100"
}
>

MongoDB – Update Document

MongoDB’s update() and save() methods are used to update document into a collection. The update() method updates the values in the existing document while the save() method replaces the existing document with the document passed in save() method.

MongoDB Update() Method

The update() method updates the values in the existing document.

Syntax

The basic syntax of update() method is as follows −

>db.COLLECTION_NAME.update(SELECTION_CRITERIA, UPDATED_DATA)

Example

Consider the mycol collection has the following data.

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will set the new title ‘New MongoDB Tutorial’ of the documents whose title is ‘MongoDB Overview’.

>db.mycol.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB Tutorial'}})
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"New MongoDB Tutorial"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>

By default, MongoDB will update only a single document. To update multiple documents, you need to set a parameter ‘multi’ to true.

>db.mycol.update({'title':'MongoDB Overview'},
   {$set:{'title':'New MongoDB Tutorial'}},{multi:true})

MongoDB Save() Method

The save() method replaces the existing document with the new document passed in the save() method.

Syntax

The basic syntax of MongoDB save() method is shown below −

>db.COLLECTION_NAME.save({_id:ObjectId(),NEW_DATA})

Example

Following example will replace the document with the _id ‘5983548781331adf45ec7’.

>db.mycol.save(
   {
      "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point New Topic",
         "by":"Tutorials Point"
   }
)
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"Tutorials Point New Topic",
   "by":"Tutorials Point"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>

MongoDB – Delete Document

In this chapter, we will learn how to delete a document using MongoDB.

The remove() Method

MongoDB’s remove() method is used to remove a document from the collection. remove() method accepts two parameters. One is deletion criteria and second is justOne flag.

  • deletion criteria − (Optional) deletion criteria according to documents will be removed.
  • justOne − (Optional) if set to true or 1, then remove only one document.

Syntax

Basic syntax of remove() method is as follows −

>db.COLLECTION_NAME.remove(DELLETION_CRITTERIA)

Example

Consider the mycol collection has the following data.

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will remove all the documents whose title is ‘MongoDB Overview’.

>db.mycol.remove({'title':'MongoDB Overview'})
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>

Remove Only One

If there are multiple records and you want to delete only the first record, then set justOne parameter in remove() method.

>db.COLLECTION_NAME.remove(DELETION_CRITERIA,1)

Remove All Documents

If you don’t specify deletion criteria, then MongoDB will delete whole documents from the collection. This is equivalent of SQL’s truncate command.

>db.mycol.remove()
>db.mycol.find()
>

MongoDB – Projection

In MongoDB, projection means selecting only the necessary data rather than selecting whole of the data of a document. If a document has 5 fields and you need to show only 3, then select only 3 fields from them.

The find() Method

MongoDB’s find() method, explained in MongoDB Query Document accepts second optional parameter that is list of fields that you want to retrieve. In MongoDB, when you execute find() method, then it displays all fields of a document. To limit this, you need to set a list of fields with value 1 or 0. 1 is used to show the field while 0 is used to hide the fields.

Syntax

The basic syntax of find() method with projection is as follows −

>db.COLLECTION_NAME.find({},{KEY:1})

Example

Consider the collection mycol has the following data −

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will display the title of the document while querying the document.

>db.mycol.find({},{"title":1,_id:0})
{"title":"MongoDB Overview"}
{"title":"NoSQL Overview"}
{"title":"Tutorials Point Overview"}
>

Please note _id field is always displayed while executing find() method, if you don’t want this field, then you need to set it as 0.

MongoDB – Limit Records

In this chapter, we will learn how to limit records using MongoDB.

The Limit() Method

To limit the records in MongoDB, you need to use limit() method. The method accepts one number type argument, which is the number of documents that you want to be displayed.

Syntax

The basic syntax of limit() method is as follows −

>db.COLLECTION_NAME.find().limit(NUMBER)

Example

Consider the collection myycol has the following data.

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will display only two documents while querying the document.

>db.mycol.find({},{"title":1,_id:0}).limit(2)
{"title":"MongoDB Overview"}
{"title":"NoSQL Overview"}
>

If you don’t specify the number argument in limit() method then it will display all documents from the collection.

MongoDB Skip() Method

Apart from limit() method, there is one more method skip() which also accepts number type argument and is used to skip the number of documents.

Syntax

The basic syntax of skip() method is as follows −

>db.COLLECTION_NAME.find().limit(NUMBER).skip(NUMBER)

Example

Following example will display only the second document.

>db.mycol.find({},{"title":1,_id:0}).limit(1).skip(1)
{"title":"NoSQL Overview"}
>

Please note, the default value in skip() method is 0.

MongoDB – Sort Records

In this chapter, we will learn how to sort records in MongoDB.

The sort() Method

To sort documents in MongoDB, you need to use sort() method. The method accepts a document containing a list of fields along with their sorting order. To specify sorting order 1 and -1 are used. 1 is used for ascending order while -1 is used for descending order.

Syntax

The basic syntax of sort() method is as follows −

>db.COLLECTION_NAME.find().sort({KEY:1})

Example

Consider the collection myycol has the following data.

{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}

Following example will display the documents sorted by title in the descending order.

>db.mycol.find({},{"title":1,_id:0}).sort({"title":-1})
{"title":"Tutorials Point Overview"}
{"title":"NoSQL Overview"}
{"title":"MongoDB Overview"}
>

Please note, if you don’t specify the sorting preference, then sort() method will display the documents in ascending order.

MongoDB – Indexing

Indexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document of a collection to select those documents that match the query statement. This scan is highly inefficient and require MongoDB to process a large volume of data.

Indexes are special data structures, that store a small portion of the data set in an easy-to-traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field as specified in the index.

The ensureIndex() Method

To create an index you need to use ensureIndex() method of MongoDB.

Syntax

The basic syntax of ensureIndex() method is as follows().

>db.COLLECTION_NAME.ensureIndex({KEY:1})

Here key is the name of the field on which you want to create index and 1 is for ascending order. To create index in descending order you need to use -1.

Example

>db.mycol.ensureIndex({"title":1})
>

In ensureIndex() method you can pass multiple fields, to create index on multiple fields.

>db.mycol.ensureIndex({"title":1,"description":-1})
>

ensureIndex() method also accepts list of options (which are optional). Following is the list −

Parameter Type Description
background Boolean Builds the index in the background so that building an index does not block other database activities. Specify true to build in the background. The default value is false.
unique Boolean Creates a unique index so that the collection will not accept insertion of documents where the index key or keys match an existing value in the index. Specify true to create a unique index. The default value is false.
name string The name of the index. If unspecified, MongoDB generates an index name by concatenating the names of the indexed fields and the sort order.
dropDups Boolean Creates a unique index on a field that may have duplicates. MongoDB indexes only the first occurrence of a key and removes all documents from the collection that contain subsequent occurrences of that key. Specify true to create unique index. The default value is false.
sparse Boolean If true, the index only references documents with the specified field. These indexes use less space but behave differently in some situations (particularly sorts). The default value is false.
expireAfterSeconds integer Specifies a value, in seconds, as a TTL to control how long MongoDB retains documents in this collection.
v index version The index version number. The default index version depends on the version of MongoDB running when creating the index.
weights document The weight is a number ranging from 1 to 99,999 and denotes the significance of the field relative to the other indexed fields in terms of the score.
default_language string For a text index, the language that determines the list of stop words and the rules for the stemmer and tokenizer. The default value is english.
language_override string For a text index, specify the name of the field in the document that contains, the language to override the default language. The default value is language.

MongoDB – Aggregation

Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result. In SQL count(*) and with group by is an equivalent of mongodb aggregation.

The aggregate() Method

For the aggregation in MongoDB, you should use aggregate() method.

Syntax

Basic syntax of aggregate() method is as follows −

>db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)

Example

In the collection you have the following data −

{
   _id: ObjectId(7df78ad8902c)
   title: 'MongoDB Overview', 
   description: 'MongoDB is no sql database',
   by_user: 'tutorials point',
   url: 'http://www.tutorialspoint.com',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 100
},
{
   _id: ObjectId(7df78ad8902d)
   title: 'NoSQL Overview', 
   description: 'No sql database is very fast',
   by_user: 'tutorials point',
   url: 'http://www.tutorialspoint.com',
   tags: ['mongodb', 'database', 'NoSQL'],
   likes: 10
},
{
   _id: ObjectId(7df78ad8902e)
   title: 'Neo4j Overview', 
   description: 'Neo4j is no sql database',
   by_user: 'Neo4j',
   url: 'http://www.neo4j.com',
   tags: ['neo4j', 'database', 'NoSQL'],
   likes: 750
},

Now from the above collection, if you want to display a list stating how many tutorials are written by each user, then you will use the following aggregate()method −

> db.mycol.aggregate([{$group : {_id : "$by_user", num_tutorial : {$sum : 1}}}])
{
   "result" : [
      {
         "_id" : "tutorials point",
         "num_tutorial" : 2
      },
      {
         "_id" : "Neo4j",
         "num_tutorial" : 1
      }
   ],
   "ok" : 1
}
>

Sql equivalent query for the above use case will be select by_user, count(*) from mycol group by by_user.

In the above example, we have grouped documents by field by_user and on each occurrence of by_user previous value of sum is incremented. Following is a list of available aggregation expressions.

Expression Description Example
$sum Sums up the defined value from all documents in the collection. db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$sum : “$likes”}}}])
$avg Calculates the average of all given values from all documents in the collection. db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$avg : “$likes”}}}])
$min Gets the minimum of the corresponding values from all documents in the collection. db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$min : “$likes”}}}])
$max Gets the maximum of the corresponding values from all documents in the collection. db.mycol.aggregate([{$group : {_id : “$by_user”, num_tutorial : {$max : “$likes”}}}])
$push Inserts the value to an array in the resulting document. db.mycol.aggregate([{$group : {_id : “$by_user”, url : {$push: “$url”}}}])
$addToSet Inserts the value to an array in the resulting document but does not create duplicates. db.mycol.aggregate([{$group : {_id : “$by_user”, url : {$addToSet : “$url”}}}])
$first Gets the first document from the source documents according to the grouping. Typically this makes only sense together with some previously applied “$sort”-stage. db.mycol.aggregate([{$group : {_id : “$by_user”, first_url : {$first : “$url”}}}])
$last Gets the last document from the source documents according to the grouping. Typically this makes only sense together with some previously applied “$sort”-stage. db.mycol.aggregate([{$group : {_id : “$by_user”, last_url : {$last : “$url”}}}])

Pipeline Concept

In UNIX command, shell pipeline means the possibility to execute an operation on some input and use the output as the input for the next command and so on. MongoDB also supports same concept in aggregation framework. There is a set of possible stages and each of those is taken as a set of documents as an input and produces a resulting set of documents (or the final resulting JSON document at the end of the pipeline). This can then in turn be used for the next stage and so on.

Following are the possible stages in aggregation framework −

  • $project − Used to select some specific fields from a collection.
  • $match − This is a filtering operation and thus this can reduce the amount of documents that are given as input to the next stage.
  • $group − This does the actual aggregation as discussed above.
  • $sort − Sorts the documents.
  • $skip − With this, it is possible to skip forward in the list of documents for a given amount of documents.
  • $limit − This limits the amount of documents to look at, by the given number starting from the current positions.
  • $unwind − This is used to unwind document that are using arrays. When using an array, the data is kind of pre-joined and this operation will be undone with this to have individual documents again. Thus with this stage we will increase the amount of documents for the next stage.

MongoDB – Replication

Replication is the process of synchronizing data across multiple servers. Replication provides redundancy and increases data availability with multiple copies of data on different database servers. Replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.

Why Replication?

  • To keep your data safe
  • High (24*7) availability of data
  • Disaster recovery
  • No downtime for maintenance (like backups, index rebuilds, compaction)
  • Read scaling (extra copies to read from)
  • Replica set is transparent to the application

How Replication Works in MongoDB

MongoDB achieves replication by the use of replica set. A replica set is a group of mongod instances that host the same data set. In a replica, one node is primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set. Replica set can have only one primary node.

  • Replica set is a group of two or more nodes (generally minimum 3 nodes are required).
  • In a replica set, one node is primary node and remaining nodes are secondary.
  • All data replicates from primary to secondary node.
  • At the time of automatic failover or maintenance, election establishes for primary and a new primary node is elected.
  • After the recovery of failed node, it again join the replica set and works as a secondary node.

A typical diagram of MongoDB replication is shown in which client application always interact with the primary node and the primary node then replicates the data to the secondary nodes.

MongoDB Replication

Replica Set Features

  • A cluster of N nodes
  • Any one node can be primary
  • All write operations go to primary
  • Automatic failover
  • Automatic recovery
  • Consensus election of primary

Set Up a Replica Set

In this tutorial, we will convert standalone MongoDB instance to a replica set. To convert to replica set, following are the steps −

  • Shutdown already running MongoDB server.
  • Start the MongoDB server by specifying — replSet option. Following is the basic syntax of –replSet −
mongod --port "PORT" --dbpath "YOUR_DB_DATA_PATH" --replSet "REPLICA_SET_INSTANCE_NAME"

Example

mongod --port 27017 --dbpath "D:\set up\mongodb\data" --replSet rs0
  • It will start a mongod instance with the name rs0, on port 27017.
  • Now start the command prompt and connect to this mongod instance.
  • In Mongo client, issue the command rs.initiate() to initiate a new replica set.
  • To check the replica set configuration, issue the command rs.conf(). To check the status of replica set issue the command rs.status().

Add Members to Replica Set

To add members to replica set, start mongod instances on multiple machines. Now start a mongo client and issue a command rs.add().

Syntax

The basic syntax of rs.add() command is as follows −

>rs.add(HOST_NAME:PORT)

Example

Suppose your mongod instance name is mongod1.net and it is running on port 27017. To add this instance to replica set, issue the command rs.add()in Mongo client.

>rs.add("mongod1.net:27017")
>

You can add mongod instance to replica set only when you are connected to primary node. To check whether you are connected to primary or not, issue the command db.isMaster() in mongo client.

MongoDB – Sharding

Sharding is the process of storing data records across multiple machines and it is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations.

Why Sharding?

  • In replication, all writes go to master node
  • Latency sensitive queries still go to master
  • Single replica set has limitation of 12 nodes
  • Memory can’t be large enough when active dataset is big
  • Local disk is not big enough
  • Vertical scaling is too expensive

Sharding in MongoDB

The following diagram shows the sharding in MongoDB using sharded cluster.

MongoDB Sharding

In the following diagram, there are three main components −

  • Shards − Shards are used to store data. They provide high availability and data consistency. In production environment, each shard is a separate replica set.
  • Config Servers − Config servers store the cluster’s metadata. This data contains a mapping of the cluster’s data set to the shards. The query router uses this metadata to target operations to specific shards. In production environment, sharded clusters have exactly 3 config servers.
  • Query Routers − Query routers are basically mongo instances, interface with client applications and direct operations to the appropriate shard. The query router processes and targets the operations to shards and then returns results to the clients. A sharded cluster can contain more than one query router to divide the client request load. A client sends requests to one query router. Generally, a sharded cluster have many query routers.

MongoDB – Create Backup

In this chapter, we will see how to create a backup in MongoDB.

Dump MongoDB Data

To create backup of database in MongoDB, you should use mongodumpcommand. This command will dump the entire data of your server into the dump directory. There are many options available by which you can limit the amount of data or create backup of your remote server.

Syntax

The basic syntax of mongodump command is as follows −

>mongodump

Example

Start your mongod server. Assuming that your mongod server is running on the localhost and port 27017, open a command prompt and go to the bin directory of your mongodb instance and type the command mongodump

Consider the mycol collection has the following data.

>mongodump

The command will connect to the server running at 127.0.0.1 and port 27017 and back all data of the server to directory /bin/dump/. Following is the output of the command −

DB Stats

Following is a list of available options that can be used with the mongodumpcommand.

Syntax Description Example
mongodump –host HOST_NAME –port PORT_NUMBER This commmand will backup all databases of specified mongod instance. mongodump –host tutorialspoint.com –port 27017
mongodump –dbpath DB_PATH –out BACKUP_DIRECTORY This command will backup only specified database at specified path. mongodump –dbpath /data/db/ –out /data/backup/
mongodump –collection COLLECTION –db DB_NAME This command will backup only specified collection of specified database. mongodump –collection mycol –db test

Restore data

To restore backup data MongoDB’s mongorestore command is used. This command restores all of the data from the backup directory.

Syntax

The basic syntax of mongorestore command is −

>mongorestore

Following is the output of the command −

DB Stats

MongoDB – Deployment

When you are preparing a MongoDB deployment, you should try to understand how your application is going to hold up in production. It’s a good idea to develop a consistent, repeatable approach to managing your deployment environment so that you can minimize any surprises once you’re in production.

The best approach incorporates prototyping your set up, conducting load testing, monitoring key metrics, and using that information to scale your set up. The key part of the approach is to proactively monitor your entire system – this will help you understand how your production system will hold up before deploying, and determine where you will need to add capacity. Having insight into potential spikes in your memory usage, for example, could help put out a write-lock fire before it starts.

To monitor your deployment, MongoDB provides some of the following commands −

mongostat

This command checks the status of all running mongod instances and return counters of database operations. These counters include inserts, queries, updates, deletes, and cursors. Command also shows when you’re hitting page faults, and showcase your lock percentage. This means that you’re running low on memory, hitting write capacity or have some performance issue.

To run the command, start your mongod instance. In another command prompt, go to bin directory of your mongodb installation and type mongostat.

D:\set up\mongodb\bin>mongostat

Following is the output of the command −

mongostat

mongotop

This command tracks and reports the read and write activity of MongoDB instance on a collection basis. By default, mongotop returns information in each second, which you can change it accordingly. You should check that this read and write activity matches your application intention, and you’re not firing too many writes to the database at a time, reading too frequently from a disk, or are exceeding your working set size.

To run the command, start your mongod instance. In another command prompt, go to bin directory of your mongodb installation and type mongotop.

D:\set up\mongodb\bin>mongotop

Following is the output of the command −

mongotop

To change mongotop command to return information less frequently, specify a specific number after the mongotop command.

D:\set up\mongodb\bin>mongotop 30

The above example will return values every 30 seconds.

Apart from the MongoDB tools, 10gen provides a free, hosted monitoring service, MongoDB Management Service (MMS), that provides a dashboard and gives you a view of the metrics from your entire cluster.

MongoDB – Java

In this chapter, we will learn how to set up MongoDB JDBC driver.

Installation

Before you start using MongoDB in your Java programs, you need to make sure that you have MongoDB JDBC driver and Java set up on the machine. You can check Java tutorial for Java installation on your machine. Now, let us check how to set up MongoDB JDBC driver.

  • You need to download the jar from the path Download mongo.jar. Make sure to download the latest release of it.
  • You need to include the mongo.jar into your classpath.

Connect to Database

To connect database, you need to specify the database name, if the database doesn’t exist then MongoDB creates it automatically.

Following is the code snippet to connect to the database −

import com.mongodb.client.MongoDatabase; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class ConnectToDB { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 
   
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 
      System.out.println("Credentials ::"+ credential);     
   } 
}

Now, let’s compile and run the above program to create our database myDb as shown below.

$javac ConnectToDB.java 
$java ConnectToDB

On executing, the above program gives you the following output.

Connected to the database successfully 
Credentials ::MongoCredential{
   mechanism = null, 
   userName = 'sampleUser', 
   source = 'myDb', 
   password = <hidden>, 
   mechanismProperties = {}
}

Create a Collection

To create a collection, createCollection() method of com.mongodb.client.MongoDatabase class is used.

Following is the code snippet to create a collection −

import com.mongodb.client.MongoDatabase; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class CreatingCollection { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 
     
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      //Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb");  
      
      //Creating a collection 
      database.createCollection("sampleCollection"); 
      System.out.println("Collection created successfully"); 
   } 
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection created successfully

Getting/Selecting a Collection

To get/select a collection from the database, getCollection() method of com.mongodb.client.MongoDatabase class is used.

Following is the program to get/select a collection −

import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase; 

import org.bson.Document; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class selectingCollection { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 
     
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb");  
      
      // Creating a collection 
      System.out.println("Collection created successfully"); 

      // Retieving a collection
      MongoCollection<Document> collection = database.getCollection("myCollection"); 
      System.out.println("Collection myCollection selected successfully"); 
   }
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection created successfully 
Collection myCollection selected successfully

Insert a Document

To insert a document into MongoDB, insert() method of com.mongodb.client.MongoCollection class is used.

Following is the code snippet to insert a document −

import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase; 

import org.bson.Document;  
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class InsertingDocument { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 

      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 

      // Retrieving a collection
      MongoCollection<Document> collection = database.getCollection("sampleCollection"); 
      System.out.println("Collection sampleCollection selected successfully");

      Document document = new Document("title", "MongoDB") 
      .append("id", 1)
      .append("description", "database") 
      .append("likes", 100) 
      .append("url", "http://www.tutorialspoint.com/mongodb/") 
      .append("by", "tutorials point");  
      collection.insertOne(document); 
      System.out.println("Document inserted successfully");     
   } 
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection sampleCollection selected successfully 
Document inserted successfully

Retrieve All Documents

To select all documents from the collection, find() method of com.mongodb.client.MongoCollection class is used. This method returns a cursor, so you need to iterate this cursor.

Following is the program to select all documents −

import com.mongodb.client.FindIterable; 
import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase;  

import java.util.Iterator; 
import org.bson.Document; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class RetrievingAllDocuments { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 

      // Creating Credentials 
      MongoCredential credential;
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb");  
      
      // Retrieving a collection 
      MongoCollection<Document> collection = database.getCollection("sampleCollection");
      System.out.println("Collection sampleCollection selected successfully"); 

      // Getting the iterable object 
      FindIterable<Document> iterDoc = collection.find(); 
      int i = 1; 

      // Getting the iterator 
      Iterator it = iterDoc.iterator(); 
    
      while (it.hasNext()) {  
         System.out.println(it.next());  
      i++; 
      }
   } 
}

On compiling, the above program gives you the following result −

Document{{
   _id = 5967745223993a32646baab8, 
   title = MongoDB, 
   id = 1, 
   description = database, 
   likes = 100, 
   url = http://www.tutorialspoint.com/mongodb/, by = tutorials point
}}  
Document{{
   _id = 7452239959673a32646baab8, 
   title = RethinkDB, 
   id = 2, 
   description = database, 
   likes = 200, 
   url = http://www.tutorialspoint.com/rethinkdb/, by = tutorials point
}}

Update Document

To update a document from the collection, updateOne() method of com.mongodb.client.MongoCollection class is used.

Following is the program to select the first document −

import com.mongodb.client.FindIterable; 
import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase; 
import com.mongodb.client.model.Filters; 
import com.mongodb.client.model.Updates; 

import java.util.Iterator; 
import org.bson.Document;  
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class UpdatingDocuments { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 
     
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 

      // Retrieving a collection 
      MongoCollection<Document> collection = database.getCollection("sampleCollection");
      System.out.println("Collection myCollection selected successfully"); 

      collection.updateOne(Filters.eq("id", 1), Updates.set("likes", 150));       
      System.out.println("Document update successfully...");  
      
      // Retrieving the documents after updation 
      // Getting the iterable object
      FindIterable<Document> iterDoc = collection.find(); 
      int i = 1; 

      // Getting the iterator 
      Iterator it = iterDoc.iterator(); 

      while (it.hasNext()) {  
         System.out.println(it.next());  
         i++; 
      }     
   }  
}

On compiling, the above program gives you the following result −

Document update successfully... 
Document {{
   _id = 5967745223993a32646baab8, 
   title = MongoDB, 
   id = 1, 
   description = database, 
   likes = 150, 
   url = http://www.tutorialspoint.com/mongodb/, by = tutorials point
}}

Delete a Document

To delete a document from the collection, you need to use the deleteOne()method of the com.mongodb.client.MongoCollection class.

Following is the program to delete a document −

import com.mongodb.client.FindIterable; 
import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase; 
import com.mongodb.client.model.Filters;  

import java.util.Iterator; 
import org.bson.Document; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class DeletingDocuments { 
   
   public static void main( String args[] ) {  
   
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 );
      
      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 

      // Retrieving a collection
      MongoCollection<Document> collection = database.getCollection("sampleCollection");
      System.out.println("Collection sampleCollection selected successfully"); 

      // Deleting the documents 
      collection.deleteOne(Filters.eq("id", 1)); 
      System.out.println("Document deleted successfully...");  
      
      // Retrieving the documents after updation 
      // Getting the iterable object 
      FindIterable<Document> iterDoc = collection.find(); 
      int i = 1; 

      // Getting the iterator 
      Iterator it = iterDoc.iterator(); 

      while (it.hasNext()) {  
         System.out.println("Inserted Document: "+i);  
         System.out.println(it.next());  
         i++; 
      }       
   } 
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection sampleCollection selected successfully 
Document deleted successfully...

Dropping a Collection

To drop a collection from a database, you need to use the drop() method of the com.mongodb.client.MongoCollection class.

Following is the program to delete a collection −

import com.mongodb.client.MongoCollection; 
import com.mongodb.client.MongoDatabase;  

import org.bson.Document;  
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class DropingCollection { 
   
   public static void main( String args[] ) {  

      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 

      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 
      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb");  
      
      // Creating a collection 
      System.out.println("Collections created successfully"); 

      // Retieving a collection
      MongoCollection<Document> collection = database.getCollection("sampleCollection");

      // Dropping a Collection 
      collection.drop(); 
      System.out.println("Collection dropped successfully");
   } 
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection sampleCollection selected successfully 
Collection dropped successfully

Listing All the Collections

To list all the collections in a database, you need to use the listCollectionNames() method of the com.mongodb.client.MongoDatabase class.

Following is the program to list all the collections of a database −

import com.mongodb.client.MongoDatabase; 
import com.mongodb.MongoClient; 
import com.mongodb.MongoCredential;  

public class ListOfCollection { 
   
   public static void main( String args[] ) {  
      
      // Creating a Mongo client 
      MongoClient mongo = new MongoClient( "localhost" , 27017 ); 

      // Creating Credentials 
      MongoCredential credential; 
      credential = MongoCredential.createCredential("sampleUser", "myDb", 
         "password".toCharArray()); 

      System.out.println("Connected to the database successfully");  
      
      // Accessing the database 
      MongoDatabase database = mongo.getDatabase("myDb"); 
      System.out.println("Collection created successfully"); 
      for (String name : database.listCollectionNames()) { 
         System.out.println(name); 
      } 
   }
}

On compiling, the above program gives you the following result −

Connected to the database successfully 
Collection created successfully 
myCollection 
myCollection1 
myCollection5

Remaining MongoDB methods save(), limit(), skip(), sort() etc. work same as explained in the subsequent tutorial.

MongoDB – PHP

To use MongoDB with PHP, you need to use MongoDB PHP driver. Download the driver from the url Download PHP Driver. Make sure to download the latest release of it. Now unzip the archive and put php_mongo.dll in your PHP extension directory (“ext” by default) and add the following line to your php.ini file −

extension = php_mongo.dll

Make a Connection and Select a Database

To make a connection, you need to specify the database name, if the database doesn’t exist then MongoDB creates it automatically.

Following is the code snippet to connect to the database −

<?php
   // connect to mongodb
   $m = new MongoClient();
	
   echo "Connection to database successfully";
   // select a database
   $db = $m->mydb;
	
   echo "Database mydb selected";
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected

Create a Collection

Following is the code snippet to create a collection −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->createCollection("mycol");
   echo "Collection created succsessfully";
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection created succsessfully

Insert a Document

To insert a document into MongoDB, insert() method is used.

Following is the code snippet to insert a document −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->mycol;
   echo "Collection selected succsessfully";
	
   $document = array( 
      "title" => "MongoDB", 
      "description" => "database", 
      "likes" => 100,
      "url" => "http://www.tutorialspoint.com/mongodb/",
      "by" => "tutorials point"
   );
	
   $collection->insert($document);
   echo "Document inserted successfully";
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection selected succsessfully
Document inserted successfully

Find All Documents

To select all documents from the collection, find() method is used.

Following is the code snippet to select all documents −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->mycol;
   echo "Collection selected succsessfully";

   $cursor = $collection->find();
   // iterate cursor to display title of documents
	
   foreach ($cursor as $document) {
      echo $document["title"] . "\n";
   }
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection selected succsessfully {
   "title": "MongoDB"
}

Update a Document

To update a document, you need to use the update() method.

In the following example, we will update the title of inserted document to MongoDB Tutorial. Following is the code snippet to update a document −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->mycol;
   echo "Collection selected succsessfully";

   // now update the document
   $collection->update(array("title"=>"MongoDB"), 
      array('$set'=>array("title"=>"MongoDB Tutorial")));
   echo "Document updated successfully";
	
   // now display the updated document
   $cursor = $collection->find();
	
   // iterate cursor to display title of documents
   echo "Updated document";
	
   foreach ($cursor as $document) {
      echo $document["title"] . "\n";
   }
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection selected succsessfully
Document updated successfully
Updated document {
   "title": "MongoDB Tutorial"
}

Delete a Document

To delete a document, you need to use remove() method.

In the following example, we will remove the documents that has the title MongoDB Tutorial. Following is the code snippet to delete a document −

<?php
   // connect to mongodb
   $m = new MongoClient();
   echo "Connection to database successfully";
	
   // select a database
   $db = $m->mydb;
   echo "Database mydb selected";
   $collection = $db->mycol;
   echo "Collection selected succsessfully";
   
   // now remove the document
   $collection->remove(array("title"=>"MongoDB Tutorial"),false);
   echo "Documents deleted successfully";
   
   // now display the available documents
   $cursor = $collection->find();
	
   // iterate cursor to display title of documents
   echo "Updated document";
	
   foreach ($cursor as $document) {
      echo $document["title"] . "\n";
   }
?>

When the program is executed, it will produce the following result −

Connection to database successfully
Database mydb selected
Collection selected succsessfully
Documents deleted successfully

In the above example, the second parameter is boolean type and used for justOne field of remove() method.

Remaining MongoDB methods findOne(), save(), limit(), skip(), sort()etc. works same as explained above.

MongoDB – Relationships

Relationships in MongoDB represent how various documents are logically related to each other. Relationships can be modeled via Embedded and Referenced approaches. Such relationships can be either 1:1, 1:N, N:1 or N:N.

Let us consider the case of storing addresses for users. So, one user can have multiple addresses making this a 1:N relationship.

Following is the sample document structure of user document −

{
   "_id":ObjectId("52ffc33cd85242f436000001"),
   "name": "Tom Hanks",
   "contact": "987654321",
   "dob": "01-01-1991"
}

Following is the sample document structure of address document −

{
   "_id":ObjectId("52ffc4a5d85242602e000000"),
   "building": "22 A, Indiana Apt",
   "pincode": 123456,
   "city": "Los Angeles",
   "state": "California"
}

Modeling Embedded Relationships

In the embedded approach, we will embed the address document inside the user document.

{
   "_id":ObjectId("52ffc33cd85242f436000001"),
   "contact": "987654321",
   "dob": "01-01-1991",
   "name": "Tom Benzamin",
   "address": [
      {
         "building": "22 A, Indiana Apt",
         "pincode": 123456,
         "city": "Los Angeles",
         "state": "California"
      },
      {
         "building": "170 A, Acropolis Apt",
         "pincode": 456789,
         "city": "Chicago",
         "state": "Illinois"
      }
   ]
}

This approach maintains all the related data in a single document, which makes it easy to retrieve and maintain. The whole document can be retrieved in a single query such as −

>db.users.findOne({"name":"Tom Benzamin"},{"address":1})

Note that in the above query, db and users are the database and collection respectively.

The drawback is that if the embedded document keeps on growing too much in size, it can impact the read/write performance.

Modeling Referenced Relationships

This is the approach of designing normalized relationship. In this approach, both the user and address documents will be maintained separately but the user document will contain a field that will reference the address document’s id field.

{
   "_id":ObjectId("52ffc33cd85242f436000001"),
   "contact": "987654321",
   "dob": "01-01-1991",
   "name": "Tom Benzamin",
   "address_ids": [
      ObjectId("52ffc4a5d85242602e000000"),
      ObjectId("52ffc4a5d85242602e000001")
   ]
}

As shown above, the user document contains the array field address_idswhich contains ObjectIds of corresponding addresses. Using these ObjectIds, we can query the address documents and get address details from there. With this approach, we will need two queries: first to fetch the address_ids fields from user document and second to fetch these addresses from addresscollection.

>var result = db.users.findOne({"name":"Tom Benzamin"},{"address_ids":1})
>var addresses = db.address.find({"_id":{"$in":result["address_ids"]}})

MongoDB – Database References

As seen in the last chapter of MongoDB relationships, to implement a normalized database structure in MongoDB, we use the concept of Referenced Relationships also referred to as Manual References in which we manually store the referenced document’s id inside other document. However, in cases where a document contains references from different collections, we can use MongoDB DBRefs.

DBRefs vs Manual References

As an example scenario, where we would use DBRefs instead of manual references, consider a database where we are storing different types of addresses (home, office, mailing, etc.) in different collections (address_home, address_office, address_mailing, etc). Now, when a user collection’s document references an address, it also needs to specify which collection to look into based on the address type. In such scenarios where a document references documents from many collections, we should use DBRefs.

Using DBRefs

There are three fields in DBRefs −

  • $ref − This field specifies the collection of the referenced document
  • $id − This field specifies the _id field of the referenced document
  • $db − This is an optional field and contains the name of the database in which the referenced document lies

Consider a sample user document having DBRef field address as shown in the code snippet −

{
   "_id":ObjectId("53402597d852426020000002"),
   "address": {
   "$ref": "address_home",
   "$id": ObjectId("534009e4d852427820000002"),
   "$db": "tutorialspoint"},
   "contact": "987654321",
   "dob": "01-01-1991",
   "name": "Tom Benzamin"
}

The address DBRef field here specifies that the referenced address document lies in address_home collection under tutorialspoint database and has an id of 534009e4d852427820000002.

The following code dynamically looks in the collection specified by $refparameter (address_home in our case) for a document with id as specified by $id parameter in DBRef.

>var user = db.users.findOne({"name":"Tom Benzamin"})
>var dbRef = user.address
>db[dbRef.$ref].findOne({"_id"😦dbRef.$id)})

The above code returns the following address document present in address_home collection −

{
   "_id" : ObjectId("534009e4d852427820000002"),
   "building" : "22 A, Indiana Apt",
   "pincode" : 123456,
   "city" : "Los Angeles",
   "state" : "California"
}

MongoDB – Covered Queries

In this chapter, we will learn about covered queries.

What is a Covered Query?

As per the official MongoDB documentation, a covered query is a query in which −

  • All the fields in the query are part of an index.
  • All the fields returned in the query are in the same index.

Since all the fields present in the query are part of an index, MongoDB matches the query conditions and returns the result using the same index without actually looking inside the documents. Since indexes are present in RAM, fetching data from indexes is much faster as compared to fetching data by scanning documents.

Using Covered Queries

To test covered queries, consider the following document in the userscollection −

{
   "_id": ObjectId("53402597d852426020000002"),
   "contact": "987654321",
   "dob": "01-01-1991",
   "gender": "M",
   "name": "Tom Benzamin",
   "user_name": "tombenzamin"
}

We will first create a compound index for the users collection on the fields gender and user_name using the following query −

>db.users.ensureIndex({gender:1,user_name:1})

Now, this index will cover the following query −

>db.users.find({gender:"M"},{user_name:1,_id:0})

That is to say that for the above query, MongoDB would not go looking into database documents. Instead it would fetch the required data from indexed data which is very fast.

Since our index does not include _id field, we have explicitly excluded it from result set of our query, as MongoDB by default returns _id field in every query. So the following query would not have been covered inside the index created above −

>db.users.find({gender:"M"},{user_name:1})

Lastly, remember that an index cannot cover a query if −

  • Any of the indexed fields is an array
  • Any of the indexed fields is a subdocument

MongoDB – Analyzing Queries

Analyzing queries is a very important aspect of measuring how effective the database and indexing design is. We will learn about the frequently used $explain and $hint queries.

Using $explain

The $explain operator provides information on the query, indexes used in a query and other statistics. It is very useful when analyzing how well your indexes are optimized.

In the last chapter, we had already created an index for the users collection on fields gender and user_name using the following query −

>db.users.ensureIndex({gender:1,user_name:1})

We will now use $explain on the following query −

>db.users.find({gender:"M"},{user_name:1,_id:0}).explain()

The above explain() query returns the following analyzed result −

{
   "cursor" : "BtreeCursor gender_1_user_name_1",
   "isMultiKey" : false,
   "n" : 1,
   "nscannedObjects" : 0,
   "nscanned" : 1,
   "nscannedObjectsAllPlans" : 0,
   "nscannedAllPlans" : 1,
   "scanAndOrder" : false,
   "indexOnly" : true,
   "nYields" : 0,
   "nChunkSkips" : 0,
   "millis" : 0,
   "indexBounds" : {
      "gender" : [
         [
            "M",
            "M"
         ]
      ],
      "user_name" : [
         [
            {
               "$minElement" : 1
            },
            {
               "$maxElement" : 1
            }
         ]
      ]
   }
}

We will now look at the fields in this result set −

  • The true value of indexOnly indicates that this query has used indexing.
  • The cursor field specifies the type of cursor used. BTreeCursor type indicates that an index was used and also gives the name of the index used. BasicCursor indicates that a full scan was made without using any indexes.
  • n indicates the number of documents matching returned.
  • nscannedObjects indicates the total number of documents scanned.
  • nscanned indicates the total number of documents or index entries scanned.

Using $hint

The $hint operator forces the query optimizer to use the specified index to run a query. This is particularly useful when you want to test performance of a query with different indexes. For example, the following query specifies the index on fields gender and user_name to be used for this query −

>db.users.find({gender:"M"},{user_name:1,_id:0}).hint({gender:1,user_name:1})

To analyze the above query using $explain −

>db.users.find({gender:"M"},{user_name:1,_id:0}).hint({gender:1,user_name:1}).explain()

MongoDB – Atomic Operations

MongoDB does not support multi-document atomic transactions. However, it does provide atomic operations on a single document. So if a document has hundred fields the update statement will either update all the fields or none, hence maintaining atomicity at the document-level.

Model Data for Atomic Operations

The recommended approach to maintain atomicity would be to keep all the related information, which is frequently updated together in a single document using embedded documents. This would make sure that all the updates for a single document are atomic.

Consider the following products document −

{
   "_id":1,
   "product_name": "Samsung S3",
   "category": "mobiles",
   "product_total": 5,
   "product_available": 3,
   "product_bought_by": [
      {
         "customer": "john",
         "date": "7-Jan-2014"
      },
      {
         "customer": "mark",
         "date": "8-Jan-2014"
      }
   ]
}

In this document, we have embedded the information of the customer who buys the product in the product_bought_by field. Now, whenever a new customer buys the product, we will first check if the product is still available using product_available field. If available, we will reduce the value of product_available field as well as insert the new customer’s embedded document in the product_bought_by field. We will use findAndModifycommand for this functionality because it searches and updates the document in the same go.

>db.products.findAndModify({ 
   query:{_id:2,product_available:{$gt:0}}, 
   update:{ 
      $inc:{product_available:-1}, 
      $push:{product_bought_by:{customer:"rob",date:"9-Jan-2014"}} 
   }    
})

Our approach of embedded document and using findAndModify query makes sure that the product purchase information is updated only if it the product is available. And the whole of this transaction being in the same query, is atomic.

In contrast to this, consider the scenario where we may have kept the product availability and the information on who has bought the product, separately. In this case, we will first check if the product is available using the first query. Then in the second query we will update the purchase information. However, it is possible that between the executions of these two queries, some other user has purchased the product and it is no more available. Without knowing this, our second query will update the purchase information based on the result of our first query. This will make the database inconsistent because we have sold a product which is not available.

MongoDB – Advanced Indexing

Consider the following document of the users collection −

{
   "address": {
      "city": "Los Angeles",
      "state": "California",
      "pincode": "123"
   },
   "tags": [
      "music",
      "cricket",
      "blogs"
   ],
   "name": "Tom Benzamin"
}

The above document contains an address sub-document and a tags array.

Indexing Array Fields

Suppose we want to search user documents based on the user’s tags. For this, we will create an index on tags array in the collection.

Creating an index on array in turn creates separate index entries for each of its fields. So in our case when we create an index on tags array, separate indexes will be created for its values music, cricket and blogs.

To create an index on tags array, use the following code −

>db.users.ensureIndex({"tags":1})

After creating the index, we can search on the tags field of the collection like this −

>db.users.find({tags:"cricket"})

To verify that proper indexing is used, use the following explain command −

>db.users.find({tags:"cricket"}).explain()

The above command resulted in “cursor” : “BtreeCursor tags_1” which confirms that proper indexing is used.

Indexing Sub-Document Fields

Suppose that we want to search documents based on city, state and pincode fields. Since all these fields are part of address sub-document field, we will create an index on all the fields of the sub-document.

For creating an index on all the three fields of the sub-document, use the following code −

>db.users.ensureIndex({"address.city":1,"address.state":1,"address.pincode":1})

Once the index is created, we can search for any of the sub-document fields utilizing this index as follows −

>db.users.find({"address.city":"Los Angeles"})

Remember that the query expression has to follow the order of the index specified. So the index created above would support the following queries −

>db.users.find({"address.city":"Los Angeles","address.state":"California"})

It will also support the following query −

>db.users.find({"address.city":"LosAngeles","address.state":"California",
   "address.pincode":"123"})

MongoDB – Indexing Limitations

In this chapter, we will learn about Indexing Limitations and its other components.

Extra Overhead

Every index occupies some space as well as causes an overhead on each insert, update and delete. So if you rarely use your collection for read operations, it makes sense not to use indexes.

RAM Usage

Since indexes are stored in RAM, you should make sure that the total size of the index does not exceed the RAM limit. If the total size increases the RAM size, it will start deleting some indexes, causing performance loss.

Query Limitations

Indexing can’t be used in queries which use −

  • Regular expressions or negation operators like $nin, $not, etc.
  • Arithmetic operators like $mod, etc.
  • $where clause

Hence, it is always advisable to check the index usage for your queries.

Index Key Limits

Starting from version 2.6, MongoDB will not create an index if the value of existing index field exceeds the index key limit.

Inserting Documents Exceeding Index Key Limit

MongoDB will not insert any document into an indexed collection if the indexed field value of this document exceeds the index key limit. Same is the case with mongorestore and mongoimport utilities.

Maximum Ranges

  • A collection cannot have more than 64 indexes.
  • The length of the index name cannot be longer than 125 characters.
  • A compound index can have maximum 31 fields indexed.

MongoDB – ObjectId

We have been using MongoDB Object Id in all the previous chapters. In this chapter, we will understand the structure of ObjectId.

An ObjectId is a 12-byte BSON type having the following structure −

  • The first 4 bytes representing the seconds since the unix epoch
  • The next 3 bytes are the machine identifier
  • The next 2 bytes consists of process id
  • The last 3 bytes are a random counter value

MongoDB uses ObjectIds as the default value of _id field of each document, which is generated while the creation of any document. The complex combination of ObjectId makes all the _id fields unique.

Creating New ObjectId

To generate a new ObjectId use the following code −

>newObjectId = ObjectId()

The above statement returned the following uniquely generated id −

ObjectId("5349b4ddd2781d08c09890f3")

Instead of MongoDB generating the ObjectId, you can also provide a 12-byte id −

>myObjectId = ObjectId("5349b4ddd2781d08c09890f4")

Creating Timestamp of a Document

Since the _id ObjectId by default stores the 4-byte timestamp, in most cases you do not need to store the creation time of any document. You can fetch the creation time of a document using getTimestamp method −

>ObjectId("5349b4ddd2781d08c09890f4").getTimestamp()

This will return the creation time of this document in ISO date format −

ISODate("2014-04-12T21:49:17Z")

Converting ObjectId to String

In some cases, you may need the value of ObjectId in a string format. To convert the ObjectId in string, use the following code −

>newObjectId.str

The above code will return the string format of the Guid −

5349b4ddd2781d08c09890f3

MongoDB – Map Reduce

As per the MongoDB documentation, Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. MongoDB uses mapReduce command for map-reduce operations. MapReduce is generally used for processing large data sets.

MapReduce Command

Following is the syntax of the basic mapReduce command −

>db.collection.mapReduce(
   function() {emit(key,value);},  //map function
   function(key,values) {return reduceFunction}, {   //reduce function
      out: collection,
      query: document,
      sort: document,
      limit: number
   }
)

The map-reduce function first queries the collection, then maps the result documents to emit key-value pairs, which is then reduced based on the keys that have multiple values.

In the above syntax −

  • map is a javascript function that maps a value with a key and emits a key-value pair
  • reduce is a javascript function that reduces or groups all the documents having the same key
  • out specifies the location of the map-reduce query result
  • query specifies the optional selection criteria for selecting documents
  • sort specifies the optional sort criteria
  • limit specifies the optional maximum number of documents to be returned

Using MapReduce

Consider the following document structure storing user posts. The document stores user_name of the user and the status of post.

{
   "post_text": "tutorialspoint is an awesome website for tutorials",
   "user_name": "mark",
   "status":"active"
}

Now, we will use a mapReduce function on our posts collection to select all the active posts, group them on the basis of user_name and then count the number of posts by each user using the following code −

>db.posts.mapReduce( 
   function() { emit(this.user_id,1); }, 
	
   function(key, values) {return Array.sum(values)}, {  
      query:{status:"active"},  
      out:"post_total" 
   }
)

The above mapReduce query outputs the following result −

{
   "result" : "post_total",
   "timeMillis" : 9,
   "counts" : {
      "input" : 4,
      "emit" : 4,
      "reduce" : 2,
      "output" : 2
   },
   "ok" : 1,
}

The result shows that a total of 4 documents matched the query (status:”active”), the map function emitted 4 documents with key-value pairs and finally the reduce function grouped mapped documents having the same keys into 2.

To see the result of this mapReduce query, use the find operator −

>db.posts.mapReduce( 
   function() { emit(this.user_id,1); }, 
   function(key, values) {return Array.sum(values)}, {  
      query:{status:"active"},  
      out:"post_total" 
   }
	
).find()

The above query gives the following result which indicates that both users tom and mark have two posts in active states −

{ "_id" : "tom", "value" : 2 }
{ "_id" : "mark", "value" : 2 }

In a similar manner, MapReduce queries can be used to construct large complex aggregation queries. The use of custom Javascript functions make use of MapReduce which is very flexible and powerful.

MongoDB – Text Search

Starting from version 2.4, MongoDB started supporting text indexes to search inside string content. The Text Search uses stemming techniques to look for specified words in the string fields by dropping stemming stop words like a, an, the, etc. At present, MongoDB supports around 15 languages.

Enabling Text Search

Initially, Text Search was an experimental feature but starting from version 2.6, the configuration is enabled by default. But if you are using the previous version of MongoDB, you have to enable text search with the following code −

>db.adminCommand({setParameter:true,textSearchEnabled:true})

Creating Text Index

Consider the following document under posts collection containing the post text and its tags −

{
   "post_text": "enjoy the mongodb articles on tutorialspoint",
   "tags": [
      "mongodb",
      "tutorialspoint"
   ]
}

We will create a text index on post_text field so that we can search inside our posts’ text −

>db.posts.ensureIndex({post_text:"text"})

Using Text Index

Now that we have created the text index on post_text field, we will search for all the posts having the word tutorialspoint in their text.

>db.posts.find({$text:{$search:"tutorialspoint"}})

The above command returned the following result documents having the word tutorialspoint in their post text −

{ 
   "_id" : ObjectId("53493d14d852429c10000002"), 
   "post_text" : "enjoy the mongodb articles on tutorialspoint", 
   "tags" : [ "mongodb", "tutorialspoint" ]
}
{
   "_id" : ObjectId("53493d1fd852429c10000003"), 
   "post_text" : "writing tutorials on mongodb",
   "tags" : [ "mongodb", "tutorial" ] 
}

If you are using old versions of MongoDB, you have to use the following command −

>db.posts.runCommand("text",{search:" tutorialspoint "})

Using Text Search highly improves the search efficiency as compared to normal search.

Deleting Text Index

To delete an existing text index, first find the name of index using the following query −

>db.posts.getIndexes()

After getting the name of your index from above query, run the following command. Here, post_text_text is the name of the index.

>db.posts.dropIndex("post_text_text")

MongoDB – Regular Expression

Regular Expressions are frequently used in all languages to search for a pattern or word in any string. MongoDB also provides functionality of regular expression for string pattern matching using the $regex operator. MongoDB uses PCRE (Perl Compatible Regular Expression) as regular expression language.

Unlike text search, we do not need to do any configuration or command to use regular expressions.

Consider the following document structure under posts collection containing the post text and its tags −

{
   "post_text": "enjoy the mongodb articles on tutorialspoint",
   "tags": [
      "mongodb",
      "tutorialspoint"
   ]
}

Using regex Expression

The following regex query searches for all the posts containing string tutorialspoint in it −

>db.posts.find({post_text:{$regex:"tutorialspoint"}})

The same query can also be written as −

>db.posts.find({post_text:/tutorialspoint/})

Using regex Expression with Case Insensitive

To make the search case insensitive, we use the $options parameter with value $i. The following command will look for strings having the word tutorialspoint, irrespective of smaller or capital case −

>db.posts.find({post_text:{$regex:"tutorialspoint",$options:"$i"}})

One of the results returned from this query is the following document which contains the word tutorialspoint in different cases −

{
   "_id" : ObjectId("53493d37d852429c10000004"),
   "post_text" : "hey! this is my post on TutorialsPoint", 
   "tags" : [ "tutorialspoint" ]
} 

Using regex for Array Elements

We can also use the concept of regex on array field. This is particularly very important when we implement the functionality of tags. So, if you want to search for all the posts having tags beginning from the word tutorial (either tutorial or tutorials or tutorialpoint or tutorialphp), you can use the following code −

>db.posts.find({tags:{$regex:"tutorial"}})

Optimizing Regular Expression Queries

  • If the document fields are indexed, the query will use make use of indexed values to match the regular expression. This makes the search very fast as compared to the regular expression scanning the whole collection.
  • If the regular expression is a prefix expression, all the matches are meant to start with a certain string characters. For e.g., if the regex expression is ^tut, then the query has to search for only those strings that begin with tut.

Working with RockMongo

RockMongo is a MongoDB administration tool using which you can manage your server, databases, collections, documents, indexes, and a lot more. It provides a very user-friendly way for reading, writing, and creating documents. It is similar to PHPMyAdmin tool for PHP and MySQL.

Downloading RockMongo

You can download the latest version of RockMongo from here: http://rockmongo.com/downloads

Installing RockMongo

Once downloaded, you can unzip the package in your server root folder and rename the extracted folder to rockmongo. Open any web browser and access the index.php page from the folder rockmongo. Enter admin/admin as username/password respectively.

Working with RockMongo

We will now be looking at some basic operations that you can perform with RockMongo.

Creating New Database

To create a new database, click Databases tab. Click Create New Database. On the next screen, provide the name of the new database and click on Create. You will see a new database getting added in the left panel.

Creating New Collection

To create a new collection inside a database, click on that database from the left panel. Click on the New Collection link on top. Provide the required name of the collection. Do not worry about the other fields of Is Capped, Size and Max. Click on Create. A new collection will be created and you will be able to see it in the left panel.

Creating New Document

To create a new document, click on the collection under which you want to add documents. When you click on a collection, you will be able to see all the documents within that collection listed there. To create a new document, click on the Insert link at the top. You can enter the document’s data either in JSON or array format and click on Save.

Export/Import Data

To import/export data of any collection, click on that collection and then click on Export/Import link on the top panel. Follow the next instructions to export your data in a zip format and then import the same zip file to import back data.

MongoDB – GridFS

GridFS is the MongoDB specification for storing and retrieving large files such as images, audio files, video files, etc. It is kind of a file system to store files but its data is stored within MongoDB collections. GridFS has the capability to store files even greater than its document size limit of 16MB.

GridFS divides a file into chunks and stores each chunk of data in a separate document, each of maximum size 255k.

GridFS by default uses two collections fs.files and fs.chunks to store the file’s metadata and the chunks. Each chunk is identified by its unique _id ObjectId field. The fs.files serves as a parent document. The files_id field in the fs.chunks document links the chunk to its parent.

Following is a sample document of fs.files collection −

{
   "filename": "test.txt",
   "chunkSize": NumberInt(261120),
   "uploadDate": ISODate("2014-04-13T11:32:33.557Z"),
   "md5": "7b762939321e146569b07f72c62cca4f",
   "length": NumberInt(646)
}

The document specifies the file name, chunk size, uploaded date, and length.

Following is a sample document of fs.chunks document −

{
   "files_id": ObjectId("534a75d19f54bfec8a2fe44b"),
   "n": NumberInt(0),
   "data": "Mongo Binary Data"
}

Adding Files to GridFS

Now, we will store an mp3 file using GridFS using the put command. For this, we will use the mongofiles.exe utility present in the bin folder of the MongoDB installation folder.

Open your command prompt, navigate to the mongofiles.exe in the bin folder of MongoDB installation folder and type the following code −

>mongofiles.exe -d gridfs put song.mp3

Here, gridfs is the name of the database in which the file will be stored. If the database is not present, MongoDB will automatically create a new document on the fly. Song.mp3 is the name of the file uploaded. To see the file’s document in database, you can use find query −

>db.fs.files.find()

The above command returned the following document −

{
   _id: ObjectId('534a811bf8b4aa4d33fdf94d'), 
   filename: "song.mp3", 
   chunkSize: 261120, 
   uploadDate: new Date(1397391643474), md5: "e4f53379c909f7bed2e9d631e15c1c41",
   length: 10401959 
}

We can also see all the chunks present in fs.chunks collection related to the stored file with the following code, using the document id returned in the previous query −

>db.fs.chunks.find({files_id:ObjectId('534a811bf8b4aa4d33fdf94d')})

In my case, the query returned 40 documents meaning that the whole mp3 document was divided in 40 chunks of data.

MongoDB – Capped Collections

Capped collections are fixed-size circular collections that follow the insertion order to support high performance for create, read, and delete operations. By circular, it means that when the fixed size allocated to the collection is exhausted, it will start deleting the oldest document in the collection without providing any explicit commands.

Capped collections restrict updates to the documents if the update results in increased document size. Since capped collections store documents in the order of the disk storage, it ensures that the document size does not increase the size allocated on the disk. Capped collections are best for storing log information, cache data, or any other high volume data.

Creating Capped Collection

To create a capped collection, we use the normal createCollection command but with capped option as true and specifying the maximum size of collection in bytes.

>db.createCollection("cappedLogCollection",{capped:true,size:10000})

In addition to collection size, we can also limit the number of documents in the collection using the max parameter −

>db.createCollection("cappedLogCollection",{capped:true,size:10000,max:1000})

If you want to check whether a collection is capped or not, use the following isCapped command −

>db.cappedLogCollection.isCapped()

If there is an existing collection which you are planning to convert to capped, you can do it with the following code −

>db.runCommand({"convertToCapped":"posts",size:10000})

This code would convert our existing collection posts to a capped collection.

Querying Capped Collection

By default, a find query on a capped collection will display results in insertion order. But if you want the documents to be retrieved in reverse order, use the sort command as shown in the following code −

>db.cappedLogCollection.find().sort({$natural:-1})

There are few other important points regarding capped collections worth knowing −

  • We cannot delete documents from a capped collection.
  • There are no default indexes present in a capped collection, not even on _id field.
  • While inserting a new document, MongoDB does not have to actually look for a place to accommodate new document on the disk. It can blindly insert the new document at the tail of the collection. This makes insert operations in capped collections very fast.
  • Similarly, while reading documents MongoDB returns the documents in the same order as present on disk. This makes the read operation very fast.

MongoDB – Auto-Increment Sequence

MongoDB does not have out-of-the-box auto-increment functionality, like SQL databases. By default, it uses the 12-byte ObjectId for the _id field as the primary key to uniquely identify the documents. However, there may be scenarios where we may want the _id field to have some auto-incremented value other than the ObjectId.

Since this is not a default feature in MongoDB, we will programmatically achieve this functionality by using a counters collection as suggested by the MongoDB documentation.

Using Counter Collection

Consider the following products document. We want the _id field to be an auto-incremented integer sequence starting from 1,2,3,4 upto n.

{
  "_id":1,
  "product_name": "Apple iPhone",
  "category": "mobiles"
}

For this, create a counters collection, which will keep track of the last sequence value for all the sequence fields.

>db.createCollection("counters")

Now, we will insert the following document in the counters collection with productid as its key −

{
  "_id":"productid",
  "sequence_value": 0
}

The field sequence_value keeps track of the last value of the sequence.

Use the following code to insert this sequence document in the counters collection −

>db.counters.insert({_id:"productid",sequence_value:0})

Creating Javascript Function

Now, we will create a function getNextSequenceValue which will take the sequence name as its input, increment the sequence number by 1 and return the updated sequence number. In our case, the sequence name is productid.

>function getNextSequenceValue(sequenceName){

   var sequenceDocument = db.counters.findAndModify({
      query:{_id: sequenceName },
      update: {$inc:{sequence_value:1}},
      new:true
   });
	
   return sequenceDocument.sequence_value;
}

Using the Javascript Function

We will now use the function getNextSequenceValue while creating a new document and assigning the returned sequence value as document’s _id field.

Insert two sample documents using the following code −

>db.products.insert({
   "_id":getNextSequenceValue("productid"),
   "product_name":"Apple iPhone",
   "category":"mobiles"
})

>db.products.insert({
   "_id":getNextSequenceValue("productid"),
   "product_name":"Samsung S3",
   "category":"mobiles"
})

As you can see, we have used the getNextSequenceValue function to set value for the _id field.

To verify the functionality, let us fetch the documents using find command −

>db.products.find()

The above query returned the following documents having the auto-incremented _id field −

{ "_id" : 1, "product_name" : "Apple iPhone", "category" : "mobiles"}

{ "_id" : 2, "product_name" : "Samsung S3", "category" : "mobiles" }

Advanced Jenkins Interview Questions And Answers 2018

Jenkins

Top 28 Jenkins Interview Questions And Answers For Experienced 2018. If you are looking for Jenkins interview questions with answers, then you are at right place. Here coding compiler sharing a list of 28 real-time interview questions on Jenkins. These Jenkins interview questions for devops will help you to crack your next Jenkins job interview. Happy reading and all the best for your future.

Jenkins Interview Questions
What is Jenkins?
Why do we use Jenkins?
What is Maven and what is Jenkins?
What is the difference between Hudson and Jenkins?
What is meant by continuous integration in Jenkins?
Why do we use Jenkins with selenium?
What are CI Tools?
What is a CI CD pipeline?
What is build pipeline in Jenkins?
What is a Jenkins pipeline?
What is a DSL Jenkins?
What is continuous integration and deployment?
What is the tool used for provisioning and configuration?
What is the difference between Maven, Ant, and Jenkins?
Which SCM tools Jenkins supports?
How schedule a build in Jenkins?
Why do we use Pipelines in Jenkins?
What is a Jenkinsfile?
How do you create Multibranch Pipeline in Jenkins?
What is the blue ocean in Jenkins?
What are the important plugins in Jenkins?
What are Jobs in Jenkins?
How do you create a Job in Jenkins?
How do you configuring automatic builds in Jenkins?
How to create a backup and copy files in Jenkins?
Jenkins Interview Questions And Answers
Jenkins Interview Questions
Jenkins is an Open source software
Jenkins is an Automation server
Jenkins can Help to automate the software development process.
Jenkins can Automate the process with continuous integration and facilitate technical aspects of continuous delivery.
Jenkins developed by Jenkins is a fork of a project called Hudson.
Jenkins License MIT
Jenkins has written in Java
Jenkins Interview Questions

# 1) What is Jenkins?

Answer # Jenkins is an open source automation server. Jenkins is a continuous integration tool developed in Java. Jenkins helps to automate the non-human part of software development process, with continuous integration and facilitating technical aspects of continuous delivery.

Jenkins Interview Questions

# 2) Why do we use Jenkins?

Answer # Jenkins is an open-source continuous integration software tool written in the Java programming language for testing and reporting on isolated changes in a larger code base in real time. The Jenkins software enables developers to find and solve defects in a code base rapidly and to automate testing of their builds.

Jenkins Interview Questions

# 3) What is Maven and what is Jenkins?

Answer # Maven is a build tool, in short a successor of ant. It helps in build and version control. However, Jenkins is continuous integration system, where in maven is used for build. Jenkins can be used to automate the deployment process.

Jenkins Interview Question

# 4) What is the difference between Hudson and Jenkins?

Answer # Jenkins is the new Hudson. It really is more like a rename, not a fork, since the whole development community moved to Jenkins. (Oracle is left sitting in a corner holding their old ball “Hudson“, but it’s just a soul-less project now.). In a nutshell Jenkins CI is the leading open-source continuous integration server.

Jenkins Interview Questions

# 5) What is meant by continuous integration in Jenkins?

Answer # Continuous integration is a process in which all development work is integrated as early as possible. The resulting artifacts are automatically created and tested. This process allows to identify errors as early as possible. Jenkins is a popular open source tool to perform continuous integration and build automation.

Interview Questions on Jenkins

Continuous Integration Interview Questions

# 6) Why do we use Jenkins with selenium?

Answer # Running Selenium tests in Jenkins allows you to run your tests every time your software changes and deploy the software to a new environment when the tests pass. Jenkins can schedule your tests to run at specific time.

Jenkins CI CD Interview Questions

# 7) What are CI Tools?

Answer # Here is the list of the top 8 Continuous Integration tools:

Jenkins
TeamCity
Travis CI
Go CD
Bamboo
GitLab CI
CircleCI
Codeship

Jenkins Pipeline Interview Questions

# 8) What is a CI CD pipeline?

Answer # A continuous integration and deployment pipeline (CD/CI) is such an important aspect of a software project. It saves a ton of manual, error-prone deployment work. It results in higher quality software for continuous integration, automated tests, and code metrics.

Jenkins Tough Interview Questions

# 9) What is build pipeline in Jenkins?

Answer # Job chaining in Jenkins is the process of automatically starting other job(s) after the execution of a job. This approach lets you build multi-step build pipelines or trigger the rebuild of a project if one of its dependencies is updated.

Jenkin Interview Questions

# 10) What is a Jenkins pipeline?

Answer # The Jenkins Pipeline plugin is a game changer for Jenkins users. Based on a Domain Specific Language (DSL) in Groovy, the Pipeline plugin makes pipelines scriptable and it is an incredibly powerful way to develop complex, multi-step DevOps pipelines.

Jenkins Interview Questions And Answers For Experienced
Jenkins Interview Questions

# 11) What is a DSL Jenkins?

Answer # The Jenkins “Job DSL / Plugin” is made up of two parts: The Domain Specific Language (DSL) itself that allows users to describe jobs using a Groovy-based language, and a Jenkins plugin which manages the scripts and the updating of the Jenkins jobs which are created and maintained as a result.

Jenkins Interview Questions For Devops

# 12) What is continuous integration and deployment?

Answer # Continuous Integration (CI) is a development practice that requires developers to integrate code into a shared repository several times a day. Each check-in is then verified by an automated build, allowing teams to detect problems early.

Jenkins Real Time Interview Questions

# 13) What is the tool used for provisioning and configuration?

Answer # Ansible is an agent-less configuration management as well as orchestration tool. In Ansible, the configuration modules are called “Playbooks”. Like other tools, Ansible can be used for cloud provisioning.

Jenkins Questions And Answers

# 14) What is the difference between Maven, Ant and Jenkins?

Answer # Maven and ANT are build tool but main difference is that maven also provides dependency management, standard project layout and project management. On difference between Maven, ANT and Jenkins, later is a continuous integration tool which is much more than build tool.

Jenkins Questions

# 15) Which SCM tools Jenkins supports?

Answer # Jenkins supports version control tools, including AccuRev, CVS, Subversion, Git, Mercurial, Perforce, ClearCase and RTC, and can execute Apache Ant, Apache Maven and sbt based projects as well as arbitrary shell scripts and Windows batch commands.

Jenkins Interview Questions For Testers
Devops Interview Questions Jenkins

# 16) How schedule a build in Jenkins?

Answer # In Jenkins, under the job configuration we can define various build triggers. Simple find the ‘Build Triggers’ section, and check the ‘ Build Periodically’ checkbox. With the periodically build you can schedule the build definition by the date or day of the week and the time to execute the build.

The format of the ‘Schedule’ textbox is as follows:

MINUTE (0-59), HOUR (0-23), DAY (1-31), MONTH (1-12), DAY OF THE WEEK (0-7)

Continuous Integration Interview Questions

# 17) Why do we use Pipelines in Jenkins?

Answer # Pipeline adds a powerful set of automation tools onto Jenkins, supporting use cases that span from simple continuous integration to comprehensive continuous delivery pipelines. By modeling a series of related tasks, users can take advantage of the many features of Pipeline:

Code: Pipelines are implemented in code and typically checked into source control, giving teams the ability to edit, review, and iterate upon their delivery pipeline.
Durable: Pipelines can survive both planned and unplanned restarts of the Jenkins master.
Pausable: Pipelines can optionally stop and wait for human input or approval before continuing the Pipeline run.
Versatile: Pipelines support complex real-world continuous delivery requirements, including the ability to fork/join, loop, and perform work in parallel.
Extensible: The Pipeline plugin supports custom extensions to its DSL and multiple options for integration with other plugins.
Questions on Jenkins

# 18) What is a Jenkinsfile?

Answer # A Jenkinsfile is a text file that contains the definition of a Jenkins Pipeline and is checked into source control.

Creating a Jenkinsfile, which is checked into source control, provides a number of immediate benefits:

Code review/iteration on the Pipeline
Audit trail for the Pipeline
Single source of truth for the Pipeline, which can be viewed and edited by multiple members of the project.
Interview Questions on Jenkins

# 19) How do you create Multibranch Pipeline in Jenkins?

Answer # The Multibranch Pipeline project type enables you to implement different Jenkinsfiles for different branches of the same project. In a Multibranch Pipeline project, Jenkins automatically discovers, manages and executes Pipelines for branches which contain a Jenkinsfile in source control.

Devops Jenkins Interview Questions

# 20) What is blue ocean in Jenkins?

Answer # Blue Ocean is a project that rethinks the user experience of Jenkins, modelling and presenting the process of software delivery by surfacing information that’s important to development teams with as few clicks as possible, while still staying true to the extensibility that is core to Jenkins.

Jenkins Interview Questions For Automation Testers

Jenkins Interview Questions For DevOps

# 21) What are the important plugins in Jenkins?

Answers # Here is the list of some important Plugins in Jenkins:

Maven 2 project
Git
Amazon EC2
HTML publisher
Copy artifact
Join
Green Balls

Interview Questions on Maven and Jenkins

#22) What are Jobs in Jenkins?

Answer # Jenkins can be used to perform the typical build server work, such as doing continuous/official/nightly builds, run tests, or perform some repetitive batch tasks. This is called “free-style software project” in Jenkins.

Jenkins Advanced Interview Questions

# 23) How do you create a Job in Jenkins?

Answer # Go to Jenkins top page, select “New Job”, then choose “Build a free-style software project”. This job type consists of the following elements:

optional SCM, such as CVS or Subversion where your source code resides.
optional triggers to control when Jenkins will perform builds.

some sort of build script that performs the build (ant, maven, shell script, batch file, etc.) where the real work happens optional steps to collect information out of the build, such as archiving the artifacts and/or recording javadoc and test results.

optional steps to notify other people/systems with the build result, such as sending e-mails, IMs, updating issue tracker, etc.

Selenium Jenkins Interview Questions

# 24) How do you configuring automatic builds in Jenkins?

Answer # Builds in Jenkins can be triggered periodically (on a schedule, specified in configuration), or when source changes in the project have been detected, or they can be automatically triggered by requesting the URL:

http://YOURHOST/jenkins/job/PROJECTNAME/build

Jenkins CI Interview Questions And Answers

# 25) How to create a backup and copy files in Jenkins?

Answer # To create a backup, all you need to do is to periodically back up your JENKINS_HOME directory. This contains all of your build jobs configurations, your slave node configurations, and your build history. To create a back-up of your Jenkins setup, just copy this directory.

Jenkins Real-Time Interview Questions
#26) What is the trustAnchors parameter must be non-empty error and how can you solve it?

A) This trustAnchors parameter must be non-empty error means that the truststore you specified was not found, or couldn’t be opened due to access permissions for example.

EJP basically answered the question (and I realize this has an accepted answer) but I just dealt with this edge-case gotcha and wanted to immortalize my solution. I had the InvalidAlgorithmParameterException error on a hosted jira server that I had previously set up for SSL-only access.

The issue was that I had set up my keystore in the PKCS#12 format, but my truststore was in the JKS format. In my case, I had edited my server.xml file to specify the keystoreType to PKCS, but did not specify the truststoreType, so it defaults to whatever the keystoreType is. Specifying the truststoreType explicitly as JKS solved it for me.

#27) What are the feature differences between Jenkins and Hudson?

A) Jenkins is the recent fork by the core developers of Hudson. To understand why, you need to know the history of the project. It was originally open source and supported by Sun. Like much of what Sun did, it was fairly open, but there was a bit of benign neglect. The source, trackers, website, etc. were hosted by Sun on their relatively closed java.net platform.

Then Oracle bought Sun. For various reasons Oracle has not been shy about leveraging what it perceives as its assets. Those include some control over the logistic platform of Hudson, and particularly control over the Hudson name. Many users and contributors weren’t comfortable with that and decided to leave.

So it comes down to what Hudson vs Jenkins offers. Both Oracle’s Hudson and Jenkins have the code. Hudson has Oracle and Sonatype’s corporate support and the brand. Jenkins has most of the core developers, the community, and (so far) much more actual work.

In fact, arguably it was Oracle who did the forking! And technically, too, that’s kinda what happened.

It’s interesting to see what comes out of “Hudson” though. While the “Winston summarizes the state and rosy future of the Hudson project” stuff they posted on the (new) Hudson website originally seemed like odd humour to me, perhaps this was a purposeful takeover, and the Sonatype guys actually have some big ideas up their sleeve. This analysis, suggesting a deliberate strategy by Oracle/Sonatype to oust Kohsuke and crew to create a more “enterprisy” Hudson is a very interesting read!

In any case, this brief comparison a fortnight after the split—while not exactly scientific—shows Jenkins to be by far more active of the two projects.

Jenkins has continued the path well-trodden by the original Hudson with frequent releases including many minor updates.

Oracle seems to have largely delegated work on the future path for Hudson to the Sonatype team, who has performed some significant changes, especially with respect to Maven. They have jointly moved it to the Eclipse foundation.

I would suggest that if you like the sound of:

Less frequent releases but ones that are more heavily tested for backwards compatibility (more of an “enterprise-style” release cycle)

A product focused primarily on strong Maven and/or Nexus integration (i.e., you have no interest in Gradle and Artifactory etc)

Professional support offerings from Sonatype or maybe Oracle in preference to Cloudbees etc

You don’t mind having a smaller community of plugin developers etc.
, then I would suggest Hudson.

Conversely, if you prefer:

More frequent updates, even if they require a bit more frequent tweaking and are perhaps slightly riskier in terms of compatibility (more of a “latest and greatest” release cycle)

A system with more active community support for e.g., other build systems / artifact repositories

Support offerings from the original creator et al. and/or you have no interest in professional support (e.g., you’re happy as long as you can get a fix in next week’s “latest and greatest”)

A classical OSS-style witches’ brew of a development ecosystem

then I would suggest Jenkins.

Jenkins CI Interview Questions
#28) How to trigger a build remotely from Jenkins? How to configure Git post commit hook?

The requirement is whenever changes are made in the Git repository for a particular project it will automatically start Jenkins build for that project.

A) As mentioned in “Polling must die: triggering Jenkins builds from a git hook”, you can notify Jenkins of a new commit:

With the latest Git plugin 1.1.14 (that I just release now), you can now do this more >easily by simply executing the following command:

curl http://yourserver/jenkins/git/notifyCommit?url=
This will scan all the jobs that’s configured to check out the specified URL, and if they are also configured with polling, it’ll immediately trigger the polling (and if that finds a change worth a build, a build will be triggered in turn.)

This allows a script to remain the same when jobs come and go in Jenkins.
Or if you have multiple repositories under a single repository host application (such as Gitosis), you can share a single post-receive hook script with all the repositories. Finally, this URL doesn’t require authentication even for secured Jenkins, because the server doesn’t directly use anything that the client is sending. It runs polling to verify that there is a change, before it actually starts a build.

As mentioned here, make sure to use the right address for your Jenkins server:

since we’re running Jenkins as standalone Webserver on port 8080 the URL should have been without the /jenkins, like this:

http://jenkins:8080/git/notifyCommit?url=git@gitserver:tools/common.git
To reinforce that last point, ptha adds in the comments:

It may be obvious, but I had issues with:

curl http://yourserver/jenkins/git/notifyCommit?url=.
The url parameter should match exactly what you have in Repository URL of your Jenkins job.
When copying examples I left out the protocol, in our case ssh://, and it didn’t work.

You can also use a simple post-receive hook like in “Push based builds using Jenkins and GIT”

#!/bin/bash
/usr/bin/curl –user USERNAME:PASS -s \

http://jenkinsci/job/PROJECTNAME/build?token=1qaz2wsx
Configure your Jenkins job to be able to “Trigger builds remotely” and use an authentication token (1qaz2wsx in this example).

However, this is a project-specific script, and the author mentions a way to generalize it.
The first solution is easier as it doesn’t depend on authentication or a specific project.

I want to check in change set whether at least one java file is there the build should start.
Suppose the developers changed only XML files or property files, then the build should not start.

Basically, your build script can:

put a ‘build’ notes (see git notes) on the first call
on the subsequent calls, grab the list of commits between HEAD of your branch candidate for build and the commit referenced by the git notes ‘build’ (git show refs/notes/build): git diff –name-only SHA_build HEAD.
your script can parse that list and decide if it needs to go on with the build.
in any case, create/move your git notes ‘build’ to HEAD.

RELATED INTERVIEW QUESTIONS
Chef Interview Questions
Puppet Interview Questions
DB2 Interview Questions
AnthillPro Interview Questions
Angular 2 Interview Questions
Hibernate Interview Questions
ASP.NET Interview Questions
PHP Interview Questions
Kubernetes Interview Questions
Docker Interview Questions
CEH Interview Questions
CyberArk Interview Questions
Appian Interview Questions
Drools Interview Questions
Talend Interview Questions
Selenium Interview Questions
Ab Initio Interview Questions
AB Testing Interview Questions
Mobile Application Testing Interview Questions
Pega Interview Questions
UI Developer Interview Questions
Tableau Interview Questions
SAP ABAP Interview Questions
Reactjs Interview Questions
UiPath Interview Questions
Automation Anywhere Interview Questions
RPA Interview Questions
RPA Blue Prism Interview Questions
Ranorex Interview Questions
AWS Interview Questions

What is Data Management

A Definition of Data Management

Data management is an administrative process that includes acquiring, validating, storing, protecting, and processing required data to ensure the accessibility, reliability, and timeliness of the data for its users. Organizations and enterprises are making use of Big Data more than ever before to inform business decisions and gain deep insights into customer behavior, trends, and opportunities for creating extraordinary customer experiences.

what is data managementTo make sense of the vast quantities of data that enterprises are gathering, analyzing, and storing today, companies turn to data management solutions and platforms. Data management solutions make processing, validation, and other essential functions simpler and less time-intensive.

Leading data management platforms allow enterprises to leverage Big Data from all data sources, in real-time, to allow for more effective engagement with customers, and for increased customer lifetime value (CLV). Data management software is essential, as we are creating and consuming data at unprecedented rates. Top data management platforms give enterprises and organizations a 360-degree view of their customers and the complete visibility needed to gain deep, critical insights into consumer behavior that give brands a competitive edge.

Data Management Challenges

While some companies are good at collecting data, they are not managing it well enough to make sense of it. Simply collecting data is not enough; enterprises and organizations need to understand from the start that data management and data Analytics only will be successful when they first put some thought into how they will gain value from their raw data. They can then move beyond raw data collection with efficient systems for processing, storing, and validating data, as well as effective analysis strategies.

Another challenge of data management occurs when companies categorize data and organize it without first considering the answers they hope to glean from the data. Each step of data collection and management must lead toward acquiring the right data and analyzing it in order to get the actionable intelligence necessary for making truly data-driven business decisions.

Data Management Best Practices

The best way to manage data, and eventually get the insights needed to make data-driven decisions, is to begin with a business question and acquire the data that is needed to answer that question. Companies must collect vast amounts of information from various sources and then utilize best practices while going through the process of storing and managing the data, cleaning and mining the data, and then analyzing and visualizing the data in order to inform their business decisions.

It’s important to keep in mind that data management best practices result in better analytics. By correctly managing and preparing the data for analytics, companies optimize their Big Data. A few data management best practices organizations and enterprises should strive to achieve include:

  • Simplify access to traditional and emerging data
  • Scrub data to infuse quality into existing business processes
  • Shape data using flexible manipulation techniques

It is with the help of data management platforms that organizations have the ability to gather, sort, and house their information and then repackage it in visualized ways that are useful to marketers. Top performing data management platforms are capable of managing all of the data from all data sources in a central location, giving marketers and executives the most accurate business and customer information available.

Benefits of Data Management and Data Management Platforms

data management platformManaging your data is the first step toward handling the large volume of data, both structured and unstructured, that floods businesses daily. It is only through data management best practices that organizations are able to harness the power of their data and gain the insights they need to make the data useful.

In fact, data management via leading data management platforms enables organizations and enterprises to use data analytics in beneficial ways, such as:

  • Personalizing the customer experience
  • Adding value to customer interactions
  • Identifying the root causes of marketing failures and business issues in real- time
  • Reaping the revenues associated with data-driven marketing
  • Improving customer engagement
  • Increasing customer loyalty

Hadoop Distributed File System

Hadoop Distributed File System
In this post we will discuss about Hadoop Distributed File System.
Hadoop is divided into two parts :
1) HDFS – Storing files
2) MapReduce – Processing of file.
HDFS: It is a special file system for storing large data on a cluster of commodity hardware in a streaming access pattern. Streaming access pattern means you can write once, read any number of times but can’t change the content of that file once it is kept in HDFS.
HDFS is suitable for batch processing – data access has high throughtput rather than low latency. It supports very large data sets. It manage file storage across multiple disks. Each disk is available on different machine in a cluster.

Difference – Unix file system and HDFS
In Unix file system default size of the block is of 4KB. Suppose your file is of 6KB, then you require 2 blocks in Unix each of 4KB. So total 8KB is used, but actually you require 6KB that extra 2KB is wasted. In HDFS default size of the block is of 64MB (128MB). Suppose your file is of 200MB, then HDFS requires 4 blocks (3 of 64MB + 1 of 8MB). Extra space of 56MB in last block is not engaged. Insted extra space is relieved.

Why block size is of 64MB(128MB) ?
NameNode maintaining metadata for each block in HDFS. If block size is kept small,maintenance of metadata itself would engage much more space relatively. To reduce this overhead block size is kept large enough. Large enough block size reduces network traffic. Due to large block size Hadoop can fetch at a time 64MB of data for processing.
Services in Hadoop
1. Namenode
2. Datanode
3. NodeMananger
4. ResourceMananger
5. SecondaryNamenode

Namenode contains metadata of the datanodes. It is basically like table of contents(Index table). It maintains directory structure. Any request from client is passed through namenode.Datanode contains actual unstructured data.
There are multiple number of Namenodes are present in production. Each stores metadata and block mapping of files and maintain directory structure. List of sub directories managed by namenode is called namespace volume. Blocks for files belonging to namespace is called block pool. So in case if one namenode is failed namespace volume managed by other namenode is still accessible. So entrie cluster dosent goes down.

 

 

S

Storing files in HDFS:

 

Large text file of GB,PB in size are break into data blocks. Each block is of same size. Everytime HDFS deals with block only. Higly fault tolerant is achieved by replicationof blocks. Default replication factor is 3. Processing time is ensured by equal size of block.
As there is trade off with respect to block size
  • Increase block size – Reduce parallelism
  • Small block size   – Increase overhead to maintain metadata

Time take to read a block data from disk is broken into 2 parts
Use metadata in the name node to lookup block locations
Read block from respective location

Detail working of HDFS:

1. The client has a file of size, say, 200MB and want to put in HDFS.
2. Client sends a request to a NameNode about list of available blocks.
3. NameNode, in response, provides the list of available blocks and maintains metadata of file (e.g. Block Number 1,3,5,7).
4. The client sends the file to available block and HDFS internally splits that file according to the block size.
5. This 200MB file is divided into a.txt(64MB), b.txt(64MB), c.txt(64MB) and d.txt(8MB). Default replication factor in hdfs is 3. Client puts the file a.txt in DataNode 1 (block 1) and this block is replicated into block numbered 2 and 4.
6. Block 4 sends an acknowledgement(ack) to block 2 and block 2 sends ack to block 1.
7. Block 1 gives ack to client. This ack contains information about replication of block 1 data.
8. After a specific interval of time every block sends block report and a heartbeat (default time is 3 second) to NameNode.
9. The block report contains the metadata of the block and the heartbeat of a node symbolizes that node is alive.
10. If any DataNode in the cluster fails, NameNode allocates another node which contains that replicated information and manages another free node to maintain the replication factor.
11. Succedingly, JobTracker comes into the picture.
12. JobTracker needs data for processing. So it contacts NameNode for block info. JobTracker sends code on particular DataNode for processing.
13. As in master slave communication JobTracker cannot directly contact DataNode, it contacts TaskTracker.
14. Map: The process where JobTracker sends information to TaskTracker.
15. TaskTracker then communicate with DataNode and processes the data available in DataNode.
16. Reducer resides in any of the DataNode that combines all o/p files.
17. Alongwith heartbeat, info about o/p file is stored in metadata.
18. The client will know o/p info by reading metadata and will directly fetch that o/p file.
19. If DataNode fails, then JobTracker assign the task to another DataNode where the replicated data is available.
How JobTracker tracks the alive status of TaskTracker? TaskTracker gives hearbeat to JobTracker after every 3 second.

 

 

 

 

What is Big Data

What is BIG DATA?
BIG DATA is “the next frontier for innovation”.
What is BIG-DATA?
The data which are beyond storing and processing capacity of a conventional database management systems is called “Big Data”. A Huge amount of data is generated daily in PetaBytes, and data generation rate is rapidly increasing.

Characterization of BIG-DATA by “4V’s”
Volume: It is very common to have Terabytes and Petabytes of the storage system for enterprises. (Volume is nothing but Size of data: MB, TB, PB, EB, ZetaB, YottaB…)
Velocity: Traversing of data through the network for processing.
Variety: Structured, Semi-Structured, and UnStructured data.
Veracity: Uncertainty of data.

Sources of BIG-DATA
The data is coming from various sources: – transactions, social media, sensors, digital images, cc camera, online shopping, Airlines-black box, videos, audios, Search engine and click-streams for domains including healthcare, retail, energy, and utilities. In last decade’s 90% of data is generated from all data available in the world. Ex. New York Stock exchange – 1TB/day, Facebook-1PB/day, Internet Archive – 20 TB/month, Large Hadron Collider near Geneva – 15 PB/year.
Where is the use of BIG-DATA
1. Understanding and Targeting Customers.
2. Understanding and Optimizing Business Processes.
3. Improving Healthcare and Public Health.
4. Improving Science and Research.
5. Optimizing Machine and Device Performance.
6. Financial Trading. and in so many fields.
Different types of Data
1. Structured Data :
All data which can be stored in the database in a row and column format i.e. Relational database and it is very simple to manage. Structured data is only 5-10% of all informatics data.
2. Semi-structured Data :
Semi-structured data doesn’t reside in RDBMS but have some organizational properties that make it easier for analyses. Ex. Log files, CSV, XML.
3. Unstructured Data :
Remaining all data is considered as unstructured data, it contains video, images, email photo, audio, video, web pages and much more. It doesn’t fit neatly into the database. Unstructured data contributes 80% of all informatics data. The growth of unstructured data in exponential than the other types of data. This data is either machine generated or human generated.
Machine-generated data: Satellite images, scientific data, Photographs, Videos, Radar or Sensor data and so many.
Human-generated unstructured data: Mobile data, Website data, Social Media data, Text data and so many.
Challenges with BIG-DATA

1) Capturing & Storing the data. (Collection and Storage)
2) Understanding and analysis of the data. (Data Analysis)
3) Synchronization across the Data Sources. (Data Transfer)
4) Getting and displaying meaningful Information out of that data. (Visualization)
Limitations of RDBMS
1) RDBMS is not able to handle huge data volumes properly, it needs to scale up database management system vertically.
2) The majority of the data comes in a semi-structured or unstructured format. RDBMS can handle only structured data.
3) Big Data generated at very high velocity.RDBMS lacks in high velocity because it’s designed for steady data retention rather than rapid growth.Even if RDBMS is used to handle and store “big data,” it will turn out to be very expensive.
Tools for BIG-DATA

NoSQL: MongoDB, CouchDB, Cassandra, Redis, BigTable, HBase, Zookeeper
MapReduce: Hadoop, Hive, Pig, Cascading, Cascalog, Caffeine, S4, MapR, Flume, Kafka, Oozie
Storage: S3, Hadoop Distributed File System.
Servers: EC2, Google App Engine, Elastic, Beanstalk, Heroku
Processing: R, Yahoo! Pipes, Mechanical Turk, ElasticSearch, Datameer, BigSheets, Tinkerpop
Applications of BIG-DATA
Recommendation
Online advertising
Stock exchange analysis
Social networking analysis
Spam filtering
Telecommunication network monitoring and much more.
Case Study
FaceBook:
Current Storage = 300 PB
Process/Day = 600TB
User/Month = 1 billion
Like/Day = 2.7 billion
Photo uploaded/Day = 300 million

NSA:
Current Storage = 5 EB
Process/Day = 30 PB
NSA toches 1.6% of internet traffic/day
(web search, website visited, phone calls. credit/debit card transactions, health, and finance info)

Google:
Current Storage = 15 EB
Process/Day = 100 PB
Searches/Second = 2.3 million
Unique Search/Month > 1 billion

LinkedIn:
Users: 37 million in 2009
Users: 450 million in 2016
Big Data System Requirement
To understand the working of tools used to process a large amount of data, you must have to understand the working of distributed computing framework.
Storage: Store the massive amount of data
Process: Process the data in a timely manner
Scale: Scale easily as data grows
Traditional data technologies not able to handle and process such huge amount of data.

Approach two solve Big Data problems
We can solve big data problems in two ways using
Scale In
Scale-Out.

Challenges with scaling up:
Complex
Costly
Less reliable
Challenges with scaling up:
Co-ordinate between machines
Handling failure of machines
Hadoop is the solution to handle BigData

Tools for BIG-DATA

YARN in hadoop THE BIG DATA

YARN in hadoop

YARN – “Yet Another Resource Negotiator”.
In this article, we will discuss the YARN. With the introduction of YARN, Hadoop becomes more powerful. YARN is introduced in Hadoop 2.x YARN provides advantages over previous versions of Hadoop including better scalability, cluster utilization, and user agility. YARN provides full backward compatibility with existing MapReduce task and application. YARN project started by Apache community to give Hadoop the ability to run the non-MapReduce program on Hadoop framework.
Fig. illustrate how YARN fits into the new Hadoop ecosystem.
Resource Manager runs on a single master node and node manager runs on all other nodes

Services in Hadoop (2.x)

After installation of Hadoop 2.x, format namenode and start all services using start-all.sh, the command on the terminal and then enter jps command. If all services shown in fig.2 are running, then installation of Hadoop is successful.

JobTracker is responsible for the Resource Manager. The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker – that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster.

The NodeManager is the per-machine slave, which is responsible for launching the application’s containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the same to the ResourceManager. The ResourceManager divides the resources among all the applications in the system. The ResourceManager has a pluggable scheduler component, which is responsible for allocating resources to the various running applications. The scheduler performs its scheduling function based on the resource requirements of an application by using the abstract notion of a resource container, which incorporates resource dimensions such as memory, CPU, disk, and network. The per-application ApplicationMaster is, performs negotiating for resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.

Execution steps of YARN:

  • Container executes a specific application
  • Single NodeManager can have multiple numbers of containers
  • After the container has been assigned to NodeManager, Resource Manager starts Application Master process within the container
  • Perform computation required for the task
  • In MapReduce Application Master process is mapper/reducer process
  • If Application Master requires extra resources then Application Master running on Node Manager request to Resource Manager for additional resources, additional resources are in the form of containers
  • Resource Manager scans the cluster and finds out free Node Manager
  • Requested NodeManager don’t have info about the other free NodeManager
  • Application Master on the original node starts off the Application Master on newly assigned nodes.

Various scheduler options

  1. FIFO Scheduler: Basically a simple “first come, first served” scheduler in which the Job-Tracker pulls jobs from a work queue, oldest job first.
  2. Capacity scheduler:-The Capacity scheduler is another pluggable scheduler for YARN that allows for multiple groups to securely share a large Hadoop cluster.
    • Capacity is distributed among different queues.
    • Each queue is allocated a share of the cluster resources
    • A job can be submitted to the specific queue
    • Within a queue, FIFO scheduling is used
    • Default scheduler – Capacity Scheduler
  3. Fair scheduler:-Fair scheduling is a method of assigning resources to applications such that all applications get, on average, an equal share of resources over time

Configure scheduling policy:

$ vi yarn-site.xml
<property>
  <name> yarn.resourcemanager.schedular.class </name>
<value> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler </value>
</property>

We can define different queue for development and production

$ vi etc/hadoop/capacity-schedular.xml
<property>
  <name>yarn.scheduler.capacity.root.queues</name>
  <value>dev, prod</value>
  <description>The queues at the this level (root is the root queue).
  </description>
</property>

<property>
  <name>yarn.scheduler.capacity.root.dev.capacity</name>
  <value>30</value>
</property>

<property>
  <name>yarn.scheduler.capacity.root.prod.capacity</name>
  <value>70</value>
</property>

We can submit the job to the specific queue

$ hadoop jar sample.jar wordCount.Main -D mapreduce.job.queue.name=prod input output

You can check the queue at localhost:8088/cluster

Containers:

A container is a collection of physical resources such as RAM, CPU cores, and disks on a single node. There can be multiple containers on a single node (or a single large one). Every node in the system is considered to be composed of multiple containers of a minimum size of memory (e.g., 512 MB or 1 GB) and CPU. The ApplicationMaster can request any container so as to occupy a multiple of the minimum size.

NodeManager:

The NodeManager is YARN’s per-node “worker” agent, taking care of the individual compute nodes in a Hadoop cluster. Its duties include keeping up-to-date with the ResourceManager, overseeing application containers’ life-cycle management, monitoring resource usage (memory, CPU) of individual containers, tracking node health, log management, and auxiliary services that may be exploited by different YARN applications. On start-up, the NodeManager registers with the ResourceManager; it then sends heartbeats with its status and waits for instructions. Its primary goal is to manage application containers assigned to it by the ResourceManager.

ApplicationMaster(AM):

The AM is the process that coordinates an application’s execution in the cluster. Each application has its own unique AM, which is tasked with negotiating resources(containers) from the ResourceManager and working with the NodeManager to execute and monitor the tasks. In the YARN design, Map-Reduce is just one application framework; this design permits building and deploying distributed applications using other frameworks. Once the AM is started (as a container), it will periodically send heartbeats to the ResourceManager to affirm its health and to update the record of its resource demands.

ResourceManager(RM):

RM is primarily a pure scheduler. It is strictly limited to arbitrating requests for available resources in the system made by the competing applications. It optimizes for cluster utilization (i.e., keeps all resources in use all the time) against various constraints such as capacity guarantees, fairness, and service level agreements (SLAs). To allow for different policy constraints, the RM has a pluggable scheduler that enables different algorithms such as those focusing on capacity and fair scheduling to be used as necessary.

That’s all in this article hopes you will clear about YARN architecture and working model

Business Intelligence

What is it (Business Intelligence)

Business intelligence (BI) is a broad category of application programs and technologies for gathering,storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications support the activities of decision support, query and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining. BI includes a set of concepts and methods to improve business decision making by using fact-based support systems.

 

Business intelligence (BI) is about creating value for our organizations based on data or, more precisely, facts. While it seems like another buzzword to describe what successful entrepreneurs have been doing for years, if not centuries, that is, using business common sense. From a modern business-value perspective,corporations use BI to enhance decision-making capabilities for managerial processes (e.g., planning,
budgeting, controlling, assessing, measuring, and monitoring) and to ensure critical information is exploited in a timely manner. And computer systems are the tools that help us do that better, faster, and with more reliability.

 

Defining BI
Business Intelligence is the art of gaining a business advantage from data by answering fundamental questions, such as how various customers rank, how business is doing now and if continued the current path, what clinical trials should be continued and which should stop having money dumped into! With a strong BI, companies can support decisions with more than just a gut feeling. Creating a fact-based “decisioning” framework via a strong computer system provides confidence in any decisions made.

 

Business intelligence is not necessarily about tools and technologies; rather it is a strategy of combining data from various sources with methodologies that make those facts solidify in a cohesive manner.The data part of this strategy is data warehousing  Once the data is sourced, scrubbed,enriched, conformed, and finally housed in “access-ready” formats BI tools can make the data sing and dance.

 

Traditional BI makes use of past data points (what you know about the data from a historical perspective)and displays it for the end user to make important inferences. The historical reporting takes advantage of the dimensionality in the data to “slice and dice” by reporting facts along any number of dimensions.Early reporting tools allowed programmers to define exactly what they wanted to present in varying levels of
granularity and aggregation. In the 1980’s a plethora of OLAP style data structures emerged, which included MOLAP, ROLAP and Hybrid-ROLAP. All of which provided the ability to drill in, around and through to make sense of the data presented.
While OLAP is certainly not dead, highly structured interfaces to the data came out of an organization’s executive branch interested in the details. In other words, taking data from “green bar” and simply transferring it to the “browser” was not enough.
Management needed to synthesize the data into meaningful bits of information. “Tell me what’s wrong.Highlight the facts for me,” was the driving force behind the dashboard and scorecards in today’s electronic toolbox.
Reporting on the past can only show what has happened, not what the future may bring. Past information must be combined with some real-time information and then layered with analytics in order to have true foreknowledge. This is where data mining, forecasting and other predictive analytics play an important role.This also turns out to be a major differentiator for SAS relative to its competitors.

 

 

SAS and BI
Having just celebrated 30 years of providing software for decision support, it is safe to say SAS has always done BI. From the early days of helping agricultural universities share statistical algorithms to supporting Fortune 100 companies today, SAS solutions take data and make sense of the patterns and provide flexibility and power in how to display and share information.
In SAS software, Business Intelligence includes:
• A set of client applications designed for a specific type of business or analyst
• SAS server processes designed to provide specific types of services for the client applications
• A centralized metadata management facility

Future of BI—Business Intelligence 2.0

As it relates to
modern computing, the concepts behind BI have been in use for decades. One of the paradigm shifts noted in recent years is moving beyond simple reporting to proactive analysis of the data and providing prescriptive recommendations on how to interpret the data. This, of course, relies on the fact that corporations can successfully move from caring about what happened in the past to a desire to not only know
what’s wrong but what is likely to get worse if nothing is done. That is where the power of advanced analytics plays such a powerful role. What lies ahead if the leap can be made that tools like reporting, querying, OLAP, dashboards, scorecards and portals can be successfully used to help make sense of the world around us?
Wikipedia defines BI like this:
… a business management term which refers to applications and technologies… used to gather, provide access to, and analyze data and information about…company operations. Businessintelligence systems… help companies have a more comprehensive knowledge of the factorsaffecting their business, such as metrics on sales, production, [and] internal operations…[BI systems] can help companies… make better business decisions. (Source: Wikipedia, 2007)

Top Ten Static Website Generators

These days, speed and security is the name of the game.

Website visitors abandon sites after just a second or two of delay, and database hacks have become commonplace. Just look at the news to see the latest scandal laid bare by hackers who gained access to sensitive information due to poorly maintained WordPress or Drupal installs.

That’s why developers, agencies and producers of web content are turning to static website generators. With modern browsers, sites built with JavaScript, APIs and Markup offer the ability to serve highly dynamic content without the shackles of the standard, painfully slow (and expensive) backend database and a server building a site each time a visitor makes a request. Flat files can be served from CDNs around the world, increasing both speed and uptime, and managing static sites with version control systems like Git means the process of creating and updating sites is highly efficient.

Of course, if you are looking to make the switch, the myriad choices can seem daunting. That’s why we’re here. We’ll take a look at a lineup of the most popular static website generators and what they’re best suited for.

To decide what to cover, we are using StaticGen.com, a leaderboard of the top open-source static site generators (full disclosure: Netlify runs staticgen.com). We’re letting the community decide by covering the tools with the greatest number of stars on GitHub.

Before we start, you’ll probably notice that ReactJS isn’t on this list. You can’t talk about front end development in 2016 without talking about React, but React is more rightly a set of tools that can be used for many things, including static site and single page app development. React has been used to create some of the SSBs on this list, and will undoubtedly continue to have a hand in the future of the modern web, but for the purposes of this article, we’re looking at tools that can build entire sites and apps, not just components.

jekyllrb.png

1. Jekyll

Jekyll is far and away the most popular static site generator. That’s no surprise, considering it underpins GitHub Pages and was created by GitHub co-founder Tom Preston-Werner.

Jekyll is built with Ruby, and is most often used for blogs and personal projects, due to its close integration with GitHub. Jekyll takes a directory filled with text files, renders that content with Markdown and Liquid templates, and generates a publish-ready static website. Its large community and wide array of plugins makes it a great jumping off place for bloggers coming from the world of WordPress and Drupal, making it easy to import content from those formats and more.

jekyllrb.com

hexo.jpg

2. Hexo

Hexo is a build tool created with nodeJS, which allows for super speedy rendering, even with extremely large sites. Hexo focuses on being a blog framework that is highly extensible, with full support for Octopress plugins out of the box, and many Jekyll plugins with a minimum of tweaking.

hexo.io

hugologo.png

3. Hugo

Hugo is a consistently namechecked static site generator built around Google’s Go programming language. It is optimized for speed (Hugo sites can be built in milliseconds) and easy to use. With no dependencies, Hugo is easy to install and update…all you need is the binary.

Hugo takes a directory with content and templates and renders them into a full html website. It’s a great choice for blogs and documentation. Content can be written in Markdown, oganized however you want with any URL structure, and metadata can be definied in YAML, TOML or JSON. All this is done with almost no configuration, meaning with Hugo, you can just get straight to work.

gohugo.io

octopress.png

4. Octopress

Octopress began its life as a modified version of Jekyll, but it has taken on a life and a community of its own. Octopress’ theme is written in Semantic HTML5 and is easy to read on mobile devices. Users of Jekyll will find themselves right at home, as many Octopress plugins can be used with minimal modification, and its out of the box framework means users can get up and running in seconds.

Octopress self-identifies as a blogging framework for hackers. It allows users to easily embed code into their posts from gists, jsFiddle or their own file systems, all with Solarized styling. It features built-in third-party integration, supporting Twitter, Pinboard, Google Analytics, Disqus comments and more.

octopress.org

 

5. Pelican

Pelican is a static site generator written in Python. Content can be written in Markdown or reStructuredText formats, and can be published in multiple languages.

Jinja2 templates allow users to customize the them, and Pelican supports code syntax highlighting. Pelican can also support Atom and RSS feeds, integrates social media accounts, external commenting tools like Disqus and Google Analytics. Content that lives elsewhere can be imported from WordPress, Dotclear or RSS feeds.

getpelican.com

brunch.jpg

6. Brunch

Brunch is an ultra-fast HTML5 assembler and build pipeline. Brunch compiles scripts, templates and style sheets, lints them, wraps them in Common.js or AMD modules, and concatenates the result.

Brunch uses skeletons to get users up and running. Brunch is better suited for users planning on building something closer to an app on the app/blog static site spectrum.

Brunch is actually better compared to Grunt or Gulp than to a blogging framework like Jekyll or Hugo. It doesn’t care about programming languages, frameworks or libraries. It just builds stuff.

brunch.io

middleman.jpg

7. Middleman

Middleman was built as a framework for advanced marketing and documentation websites, instead of a static blogging engine. It’s grown to become one of the most widely used static build tools for enterprise sites, with companies like MailChimp, Sequoia Capital and Vox Media creating their sites in Middleman.

Middleman is a command-line tool that uses Ruby and Ruby Gems to build web applications with CoffeeScript, asset management solutions like Sprockets, and uses ERB and HAML for dynamic pages and simplified HTML syntax. Additionally, Middleman’s powerful API allows extension authors to hook in to the toolchain at different points.

middlemanapp.com

metalsmith.png

8. Metalsmith

With Metalsmith, the sky’s the limit. That’s because Metalsmith is extremely simple – it’s a collection of user-defined plugins. Because of that, Metalsmith can build just about anything, from blogs to documentation to webapps and just about anything in between.

It’s worth noting that Metalsmith’s structure means that users should have a fairly high level of technical proficiency before trying to tackle a project. Beginners would be better served by one of the other tools on this list. But if you want something infinitely flexible, Metalsmith could be the tool of choice for you.

metalsmith.io

harplogo.png

9. Harp

Harp is a static web server that also serves Jade, Markdown, EJS, Less, Stylus, Sass, and CoffeeScript as HTML, CSS, and JavaScript without any configuration. Harp allows you to reuse partials and common elements, so that you can preserve consistency across design and layouts. It compiles assets on an as-needed basis, so changes can be displayed with just a simple save and refresh.

harpjs.com/

expose.png

10. Exposé

Exposé is quite a bit different than the other offerings on this list. It’s actually just a Bash script that turns images and videos into beautiful photoessays. It’s best experienced, rather than explained, so do yourself a favor and look at the personal blogs of Expose’s creator, located at jack.ventures and jack.works.

github.com/Jack000/Expose

Honorable Mentions:

There’s always a few favorites that get left out of any top ten list, and this one is no different. We decided to add a few of our personal favorites to the list, just because we like ‘em.

gatsby.png

Gatsby

Gatsby takes Markdown and other static data sources and turns them into dynamic blogs and websites using ReactJS. By supporting the component-driven development model of React, Gatsby is able to re-use components across a site, adding consistency and speed. Blogs developed in Gatsby function as a single-page app, with JS bundles preloaded, so page transitions are instantaneous.

github.com/gatsbyjs

roots.png

Roots

Roots is a static site compiler built in NodeJS, that generates static HTML, CSS and JavaScript files. A product of digital agency Carrot Creative, Roots is streamlined for use by freelancers and agencies to make highly variable builds quicker and easier. Roots comes in the form of a static site build tool by default, but also includes templates and plugins for Express and Rails. Roots comes with out of the box support for Jade, CoffeeScript and Stylus, with an easily extensible asset pipeline.

roots.cx

gitbook.png

GitBook

GitBook is quite a bit different than your standard static web tool, but follows one of the cardinal rules of the static site toolchain: Do one thing and do it well. In the case of GitBook, that one thing is eBooks.

With GitBook, you can write your book in Markdown or AsciiDoc format, and publish by pushing to GitHub. If you aren’t comfortable working in the command line, you also have the option of using a web UI or a desktop editor. And once you are done, you can output your book as a website or an eBook in pdf, epub or mobi format.

gitbook.com

cactusformac.png

Cactus

Cactus sets itself apart from the crowd by being a little more beginner-friendly than some of the other options listed above, due to the existence of the Cactus Mac app. The application allows for simple setup of frameworks for blogs, portfolios and profiles, with built in deploy for Amazon S3. Underpinning all that is a modern build tool that runs on Python and uses Django’s templating language.

github.com/koenbok/Cactus/


Once you’ve chosen a static site generator, you can use Netlify to host and deploy your static site or app. Although there are many static site hosting services, only Netlify gives you built-in performance, security, and flexibility. And of course, it supports every generator listed here and many more.


An Introduction to Static Site Generators

What you need to Know

Static site generators seem to have been becoming more and more popular recently, but they’re not one of those ephemeral novelty things that grow in popularity as quickly as they fall into oblivion shortly after. For over a decade, many different projects – 394 of them, to be more precise – have been maintained by lots of varied people in the community and built with a diverse range of programming languages and technologies.

I often read on articles about this subject that “static sites are not for everyone”, partially due to the lack of a UI to manage content and to the sometimes unfriendly installation process. But actually I think they can be for everyone, just not for everything.

The aim of this article is to help people of all skill levels understand exactly what static site generators are, acknowledge their advantages, and understand if their limitations are a deal-breaker or if, on the contrary, they can be overcome. With that, you’ll hopefully be able to make an informed decision on whether or not a static site can be the solution for your next project.

The concepts described throughout the article are valid for all static site generators, since they all share the same philosophy, although I’ll have Jekyll in mind when I write purely because that’s the one I use and have most experience with. It’s quite a mature product, has a huge community and the big bonus of being natively supported by GitHub pages. However, alternatives such as Docpad, Hugo and Wintersmith are also widely used and definitely worth investigating.

How dynamic sites work
Try to imagine for a second that the only way for people to know what’s happening in the world is to go to the nearby news kiosk and ask to read the latest news. Yes, I know it’s silly but it will all make sense in a bit, please bear with me.

The attendant has no way to know what the latest news are, so he passes the request on to a back room full of telephone operators — picture a big telephone switchboard room in the 1950s. When an operator becomes available, they will take the request and phone a long list of news agencies, ask for the latest news and then write the results as bullet points on a piece of paper.

The operator will then pass his rough notes on to a scribbler who will write the final copy to a nice sheet of paper, arrange them in a certain layout and add a few bits and pieces such as the kiosk branding and contact information. Finally, the attendant takes the finished paper and serves it to the happy customer. The entire process will then be repeated for every person that arrives at the kiosk.

That is essentially how a dynamic website works. When a visitor gets to a website (the kiosk) expecting the latest content (the news), a server-side script (the operators) will query one or multiple databases (news agencies) to get the content, pass the results to a templating engine (the scribble) who will format and arrange everything properly and generate an HTML file (the finished newspaper) for the user to consume.

How static sites work
The proposition of a static site is to shift the heavy load from the moment visitors request the content to the moment content actually changes. Going back to our news kiosk metaphor, think of a scenario where it’s the news agencies who call the kiosk whenever something newsworthy happens.

The kiosk operators and scribbles will then compile, format and style the stories and produce a finished newspaper right away, even though nobody ordered one yet. They will print out a huge number of copies (infinite, actually) and pile them up by the store front.

When customers arrive, there’s no need to wait for an operator to become available, place the phone call, pass the results to the scribble and wait for the final product. The newspaper is already there, waiting in a pile, so the customer can be served instantly.

And that is how static site generators work. They take the content, typically stored in flat files rather than databases, apply it against layouts or templates and generate a structure of purely static HTML files that are ready to be delivered to the users.

Advantages of static
1) Speed
Perhaps the most immediately noticeable characteristic of a static site is how fast it is. As mentioned above, there are no database queries to run, no templating and no processing whatsoever on every request.

Web servers are really good at delivering static pages quickly, and the entire site consists of static HTML files that are sitting on the server, waiting to be served, so a request is served back to the user pretty much instantly.

2) Version control for content
You can’t even imagine working on a project without version control anymore, can you? Having a repository where people can collaboratively work on files, control exactly who does what and rollback changes when something goes wrong is essential in any software project, no matter how small.

But what about the content? That’s the keystone of any site and yet it usually sits in a database somewhere else, completely separated from the codebase and its version control system. In a static site, the content is typically stored in flat files and treated as any other component of the codebase. In a blog, for example, that means being able to have the actual posts stored in a GitHub repository and allowing your readers to file an issue when something is wrong or to add a correction with a pull request — how cool is that?

3) Security
Platforms like WordPress are used by millions of people around the world, meaning they’re common targets for hackers and malicious attacks — no way around it. Wherever there’s user input/authentication or multiple processes running code on every request, there’s a potential security hole to exploit. To be on top of the situation, site administrators need to keep patching their systems with security updates, constantly playing cat and mouse with attackers, a routine that may be overlooked by less experienced users.

Static sites keep it simple, since there’s not much to mess up when there’s only a web server serving plain HTML pages.

4) Less hassle with the server
Installing and maintaining the infrastructure required to run a dynamic site can be quite challenging, especially when multiple servers are involved or when something needs to be migrated. There’s packages, libraries, modules and frameworks with different versions and dependencies, there’s different web servers and database engines in different operating systems.

Sure, a static site generator is a software package with its dependencies as well, but that’s only relevant at build time, when the site is generated. Ultimately, the end result is a collection of HTML files that can be served anywhere, scaled and migrated as needed regardless of the server-side technologies. As for the site generation process, that can be done from an environment that you control locally and not necessarily on the web server that will run the site — heck, you can build an entire site on your laptop and push the result to the web when it’s done.

5) Traffic surges
Unexpected traffic peaks on a website can be a problem, especially when it relies intensively on database calls or heavy processing. Introducing caching layers such as Varnish or Memcached surely helps, but that ends up introducing more possible points of failure in the system.

A static site is generally better prepared for those situations, as serving static HTML pages consumes a very small amount of server resources.

Disadvantages of static (and potential solutions)
1) No real-time content
With a static site you lose the ability to have real-time data, such as indication about which stories have been trending for the past hour, or content that dynamically changes for each visitor, like a “recommended articles for you” kind of thing. Static is static and will be the same for everyone.

There’s not really a solution for this, I’m afraid. It’s the ultimate price to pay for using a static site, so it’s important that you ask yourself the question “how real-time does my site need to be?” — if its concept is based around delivering real-time information then perhaps a static site isn’t the right choice.

A dangerous solution: There’s an easy exit for whenever you’re faced with the challenge of dynamically updating content on a static site: “I can do it with JavaScript”. Doing processing on the client-side and appending the results to the page after it’s been served can be the right approach for some cases, but must not be seen as the magic solution that turns your static site into a full dynamic one. It can prevent some users from seeing the injected content, hurt your SEO and introduce other problems, potentially taking away the ease of mind and sense of control that comes with using a static site.

2) No user input
Adding user generated content to a static site is a bit of a challenge. Take a commenting system for a blog, for example — how do you process user comments and append them to a post using just plain HTML pages? You don’t.

Solution
You can’t get around this limitation per se and start processing data in your static pages, but you can find alternative solutions for individual cases. If you need to create a contact form, there are a lot of third-party services that will handle POST requests and email you the data, or export it to a format of your choice.

A commenting system is a slightly different animal though, since it involves not only processing user data but also appending it to a certain page. Platforms like Disqus are often used as a workaround for this and they do the job, but I’m personally not a big fan.

First of all, there’s what we discussed in point 1 — Disqus will append the comments to the page with JavaScript after it’s been served, so technically the comments don’t exist on your site until the JavaScript kicks in. Secondly, with this approach you contradict the premise of keeping the content together and versioned within a repository.

I’ve written about taking a different approach at a commenting system for Jekyll, which basically uses a server-side handler to process comments, add them to the repository and push to GitHub, keeping comments together with the rest of the site.

3) No admin UI
It’s incredibly easy to publish a blog post to WordPress or Medium. It can be done from anywhere, even from a phone, without having to install any additional software. That’s not really the case with a static site.

Typically, posts are composed in a text editor and formatted with a language like Markdown or Textile. To publish them, you’d need to regenerate the site (most engines have a watch functionality to detect file changes and regenerate the site automatically) and deploy the files to a server. It’s a bit hard to do all that on a phone sitting on the beach, isn’t it?

Solution
There are platforms that provide a web interface for creating, editing and deleting files directly on a GitHub repository, offering a WYSIWYG editor for Markdown to create a friendly composition interface. Examples are prose, a free and open source solution, or the more advanced CloudCannon, a commercial product that allows users to edit entire sections of a static site and see a live preview of the changes.

There are also mobile apps, available for both iOS and Android, that are a viable option for people interested in writing and publishing content on the move. The apps connect with GitHub and the changes are instantly pushed to the repository.

Another option is to set up a service that allows users to post to a static blog by email, which can be a viable solution for those that need to constantly write on the move. It works by listening for emails on a certain address and picking up the post meta data from the subject line, the images from the attachments and the post body from the message itself.

Conclusions
Switching to a static site can potentially save you time and money, as it requires less maintenance and less server resources. They’re reliable, scalable and can handle high volumes of traffic quite well.

In 2012, Obama’s presidential campaign raised $250M through a Jekyll website. In 2013, Healthcare.gov switched to a CMS-free approach using Jekyll as well. Static sites are powering huge projects and are definitely not limited to blogging. There’s also a strong open source community maintaining and pushing forward a wide range of engines with different flavours and features.

However, a static site is not some magical solution that will solve all the problems — they’re perfect for some cases, but terrible for others. It’s vital to understand how they work and what they can do in order to assess, on a per-project basis, whether or not they’re the right tool for the job.