The post Logistic Regression Using R appeared first on R Data science .

]]>Logistic Regression is one among the machine learning algorithms used for solving classification problems. it is used to estimate probability whether an instance belongs to a category or not. If the estimated probability is bigger than threshold, then the model predicts that the instance belongs to that class, alternatively it predicts that it doesn’t belong to the category as shown in fig 1. This makes it a binary classifier. Logistic regression is employed where the worth of the variable is 0/1, true/false or yes/no.

**Example 1**

Suppose we have an interest to understand whether a candidate will pass the doorway exam. The results of the candidate depends upon his attendance within the class, teacher-student ratio, knowledge of the teacher and interest of the scholar within the subject are all independent variables and result’s variable . the worth of the result are going to be yes or no. So, it’s a binary classification problem.

- Why Logistic Regression, Not linear regression

Linear Regression models the connection between variable and independent variables by fitting a line as shown in Fig

In linear regression, the worth of predicted Y exceeds from 0 and 1 range. As discussed earlier, Logistic Regression gives us the probability and therefore the value of probability always lies between 0 and 1. The logistic function is defined as:

1 / (1 + e^-value)

Where e is that the base of the natural logarithms and value is that the actual numerical value that you simply want to rework . The output of this function is usually 0 to 1.

The equation of linear regression is

Y=B0+B1X1+…+BpXp

Logistic function is applied to convert the output to 0 to 1 range

P(Y=1)=1/(1+exp(?(B0+B1X1+…+BpXp)))

We need to reformulate the equation in order that the linear term is on the proper side of the formula.

log(P(Y=1)/1?P(Y=1))= B0+B1X1+…+BpXp

where log(P(Y=1)/1?P(Y=1)) is named odds ratio.

**4. How to find the threshold value**

res<-predict(model,training,type=”response”)

library(ROCR)

ROCRPred=prediction(res,training$target)

ROCRPerf<-performance(ROCRPred,”tpr”,”fpr”)

plot(ROCRPerf,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))

While selecting the edge value, we should always lookout that true positive rate should be maximum and false negative rate should be minimum. Because, if an individual has disease, but the model is predicting that he’s not having disease, it’s going to cost someone’s life

The plot shows that if we take threshold=0.4, true positive rate increase.

res<-predict(model,testing,type=”response”)

table(Actualvalue=testing$target,Predictedvalue=res>0.4)

Here, we will see that the worth of True negative decreases from 7 to five .

Accuracy of the model

The accuracy of the model is coming an equivalent if we use threshold value=0.4 or 0.5 but just in case of threshold value=0.4, truth negative cases decrease. So, it’s better to require 0.4 value as threshold value. the choice of threshold value depends upon the utilization case. just in case of medical problems, our focus is to decrease true negatives because if an individual has disease, but the model is predicting that he’s not having disease, it’s going to cost someone’s life.

Resource Article : https://www.excelr.com/blog/data-science/regression/understanding-logistic-regression-using-r

The post Logistic Regression Using R appeared first on R Data science .

]]>The post Statistical Analysis for Data science appeared first on R Data science .

]]>Through this blog, we aim to supply a definitive understanding to the reader on how the method of Statistical Analysis for Data Science are often done on an actual business use case.

**Let’s start :**

Data are often analysed to urge valuable insights, but when analysis isn’t done, data is simply a bunch of numbers that wouldn’t make any sense.

According to Croxton and Cowden,

Statistics may be a Science of Collection, Presentation, Analysis and Interpretation of any numerical data.

**A few examples include:**

Route Optimisation in Airlines Industry

ROI Prediction of a corporation

Stock Market Share Price Prediction

Predictive Maintenance in Manufacturing

For any data set, statistical analysis for Data Science are often done consistent with the six points as shown below. They form the skeleton of statistical analysis.

**The steps are as follows :**

Defining business objective of study

Collection of knowledge

Data Visualization

Data Pre-Processing

Data Modelling

Interpretation of knowledge

**Step 1: Defining the target of the analysis :**

The first step is to know the business objective and therefore the reason for the analysis.

Objective also can be an exercise that’s wont to reduce costs, improve efficiency etc….

In this case, our objective is obvious . it’s to predict the number which will be sold for December 2020 using the past data.

**Step 2: Collection of knowledge**

This is the foremost important step within the analysis process. Because here you’ve got to gather the specified data from various sources.

**Step 3: Data Visualization**

This step is crucial because it will help us understand the non-uniformities of the info during a data set. this may help us visualize the info during a manner which will help us fill the gaps and expedite the method of study . Various tools like Tableau,Power BI are often used for the aim of knowledge Visualization.

**Step 4: Data Pre-Processing**

I. Data preprocessing/Data wrangling/Data cleaning:

Data preprocessing is that the process of gathering, selecting, and reworking data for easier data analysis. it’s also referred to as data cleaning or munging. it’s the foremost important process, because it accounts to 80% of the entire duration of study .

**Step 5 : Data Modelling :**

After data preprocessing, the info is prepared for analysis. We must choose statistical techniques like ANOVA, Regression or the other methods, supported the variables within the data.

To find the sales for the month of Dec 2020, we’ll use the moving average technique.

Note : There are many techniques like Moving Average, Exponential Smoothing, Advanced Smoothing etc… which will be used for forecasting sales. Here supported objective, the author’s inclination is towards the moving average technique.

Based on the info , the sixth month moving average is 245. Here’s how we got the moving average.

**Step 6: Interpretation:**

We then come to the ultimate step of our analysis which is Interpretation. supported modelling analysis,our interpretation is that, for the month of December 2020 we will sell 245 packets of 1kg quantity. during this way, we will predict the longer term sales using historical data.

**Conclusion**

The 6 steps during this blog, enhances your understanding of varied applications of statistical concepts in Data Science. Further stats are often divided into various categories like Descriptive Statistics, Inferential Statistics, Predictive Stats etc…. supported the info set and objective we affect . inspect these blogs now and understand how each of those aspects of stats are often utilized in detail.

Resource Article : https://www.excelr.com/blog/data-science/statistical-analysis/the-ultimate-guide-to-statistical-analysis-for-data-science-6-step-framework

The post Statistical Analysis for Data science appeared first on R Data science .

]]>The post Understanding Data Pre-Processing in Statistical Analysis appeared first on R Data science .

]]>In simple words, Data preprocessing is the process of collecting, selecting, and transforming data to analyze data. Basically, it means converting data into an understandable format.

It is also known as data cleaning or munging. It is the most time taking process of statistical analysis as it accounts for 80% of the time taken for analysis. Data preprocessing is very important in any statistical analysis, otherwise, it will directly impact the success rate of the project.

Data preprocessing allows us to remove any unnecessary data with the use of various techniques, this allows the user to have a dataset that contains more valuable information.

These datasets are edited to correct the data corruption or human error. This is an important step to get accurate quantifiers like true positives, true negatives, False positives, and false negatives found in a Confusion matrix that are commonly used for a medical diagnosis.

Any unnecessary information is removed from the data which allows the analyst to have higher accuracy. Analysts use Python programming scripts accompanied by the pandas library which gives them the ability to import data from Comma-separated values as a data-frame.

Now let us discuss the methods of data preprocessing. Data Preprocessing is done using the two methods :

**Missing Value Treatment****Outlier Correction**

Let’s understand Missing Value Treatment First :

**i. Missing value treatment**–

There could be
several reasons behind the missing values like – human error, data incorrectly
received, output error, and so on. To fill the
missing values, we use **imputation
techniques** like mean, median, and mode.

Let us now look at the second method of data pre-processing

i**i. Outlier correction/ Treatment-**

Before we
understand the procedure, let’s understand what** outliers **are**:**

An outlier is the data point that lies outside the range of the remaining data points in a dataset.

Eg: The sale of electronic goods during holidays like Black Friday sale, new year, and so on, could result in outliers in data of sales for that entire year. This is because the electronic goods sales are way higher in these days of the year when compared to normal days.

To detect outliers we have a simple technique called the box plot method. The values that fall outside the upper and lower limit are called outliers.

Data Preprocessing is a very crucial stage in the whole process of statistical analysis, and you can not afford to have any sort of mistake at this point. If you did any error at this point then you won’t get the desired result from the analysis. As said, this step itself takes almost 80% of the time of the statistical analysis process

The post Understanding Data Pre-Processing in Statistical Analysis appeared first on R Data science .

]]>The post What is data science, and how it is helping brands stand out from the competition? appeared first on R Data science .

]]>Data science is a technology that analyzes complex data that provides key business insights that help businesses make better decisions. This, in turn, improves the profits of a company.

Most people think that data science is just a programming language. Data Science is a technology that requires three key ingredients: the first ingredient being the domain knowledge, the second ingredient being statistics, and probability, and the third ingredient being programming skills. So Data Science is not merely a programming language but something more than that.

**Let us try to understand the concept of data science using an example.**

In 2019 Liverpool football club of England won the English Premier League. This all started in 2015 when new manager Jurgen Klopp was hired, and he decided to use data science with the in-house expertise of Ian Graham who is the head of research data science at Liverpool Football Club.

But you must be wondering why the Liverpool Football Club took four years to win the championship despite using data science in the year 2015.

Data Science takes time to give results, and it’s a long process consisting of various steps.

**The four-step process of data science is given below: **

1. Objective of analysis.

2. data collection.

3. data cleansing.

4. Data Visualization.

**Let’s understand them one by one- **

**1. Objective of analysis:** First of all, you have to find out the purpose of doing the analysis. Here in Liverpool’s case, the purpose of doing the analysis was how to plug gaps in the team and win the championship in the coming years.

**2. Data Collection: **This is the most crucial step as here we collect the required data that is going to be the pillar for everything and the outcome will be dependent on this step.

So, in Liverpool’s case, they collected the required data from all the newspapers, magazines, and fan clubs and made a consolidated list of which players and what skills they wanted.

**3. Data Cleansing:** After the data that has been collected from various sources. It has to be cleansed. Data Cleansing means segregating the data accordingly and removing unnecessary data.

Data cleansing is tedious yet the most important aspect in the process of data science. It is important to have crisp and clear data if one wants to have the desired results.

**4. Data Visualization:** Data Visualization is the final step in this process. It means presenting the data in the easiest way possible. Commonly Graphs and pie charts are used for visualization. Because not everyone can understand the complex stats involved.

The post What is data science, and how it is helping brands stand out from the competition? appeared first on R Data science .

]]>The post NLP: Text Cleaning & Preprocessing Methods appeared first on R Data science .

]]>NLP, in other words, natural language processing is a convergence between linguistics, computer science, machine learning, and artificial intelligence. It’s a set of algorithms that aims to analyze and model the high volume of human language. In simple words, it acts as an interconnection between human language and computers and provides users with optimum results.

As we all know computers only understand binary codes 0 and 1 rather than understanding the words. So it converts every other language to its code and processes what we want to search. Therefore a lot of research and development is happening in Natural Language Processing, always.

Some of the practical implementations of NLP are Microsoft Cortana, Amazon Alexa, Apple Siri, Google Assistant. With smart coding, and applications of ML and AI, they can analyze and understand what questions we raised and can answer within a few seconds. Maybe it is about weather updates, or any news or spam mail filtering is an example of NLP.

But the applications of NLP is not limited to the above, it has got a big hand in text classification and sentiment analysis, text summarization, along with other classification models. The input data is always in natural ways how humans say – sentences and paragraphs. While processing, it converts human languages into machine-understandable languages by cleaning the variations of the words to their root format and gives them the desired output, NLP plays a crucial role in our everyday’s life.

For building any machine learning, and artificial intelligence model, data processing is the fundamental step that makes the database cleaner and helps in reducing dimensions. The python library used for doing pre-processing tasks in NLP is NLTK, in other words, Natural Language Toolkit.

It is the process of converting sentences into words.

import nltk

from nltk.tokenize import word_tokenize

token = word_tokenize(“My Email address is: imexpert@gmail.com”)

token

It converts the tokenized words into the lowercase format. The words have the same meaning as ‘nlp’ and’ NLP. And if they are not converted into lowercase, then both will be considered as non-identical words in vector space models.

Lowercase = []

for lowercase in token:

Lowercase.append(lowercase.lower())

Lowercase

This concept mostly gets used when you do not have significance while determining two different documents such as (a, an, the, etc) so that they can be removed.

from nltk.corpus import stopwords

stop_words = stopwords.words(‘english’)

from string import punctuation

punct = list(punctuation)

print(dataset[1][‘quote’])

tokens = word_tokenize(dataset[1][‘quote’])

len(tokens)

It is the advanced process in which the words get converted to their base form. In simple words, when it sees a variety of words having a common root term, it considers them all the same.

from nltk.stem import PorterStemmer

ps = PorterStemmer()

print(ps.stem(‘jumping’))

print(ps.stem(‘lately’))

print(ps.stem(‘assess’))

print(ps.stem(‘ran’))

The major difference between stemming and lemmatization is, lemmatization lowers the words to the root words in the present language.

For example, the word has and is changed to ha and be.

from nltk import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize(‘ran’, ‘v’))

print(lemmatizer.lemmatize(‘better’, ‘a’))

The post NLP: Text Cleaning & Preprocessing Methods appeared first on R Data science .

]]>The post Ridge and Lasso Regression appeared first on R Data science .

]]>In a general manner, to form things regular or acceptable is what we mean by the term regularization. this is exactly why we use it for applied machine learning. within the domain of machine learning, regularization is that the process which prevents overfitting by discouraging developers learning a more complex or flexible model, and eventually , which regularizes or shrinks the coefficients towards zero. the essential idea is to penalize the complex models i.e. adding a complexity term in such how that it tends to offer a much bigger loss for evaluating complex models.

**Lasso Regression (L1 Regularization)**

This regularization technique performs L1 regularization. Unlike Ridge Regression, it modifies the RSS by adding the penalty (shrinkage quantity) like the sum of absolutely the value of coefficients.

Looking at the equation below, we will observe that almost like Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also penalizes absolutely the size of the regression coefficients. additionally to the present , it’s quite capable of reducing the variability and improving the accuracy of rectilinear regression models.

**Ridge Regression (L2 Regularization)**

This technique performs L2 regularization. the most algorithm behind this is often to switch the RSS by adding the penalty which is like the square of the magnitude of coefficients. However, it’s considered to be a way used when the data suffers from multicollinearity (independent variables are highly correlated). In multicollinearity, albeit the littlest amount squares estimates (OLS) are unbiased, their variances are large which deviates the observed value away from truth value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors. It tends to unravel the multicollinearity problem through shrinkage parameter λ.

Now, let’s see if ridge regression works better or lasso are going to be better. For ridge regression, we introduce GridSearchCV. This may allow us to automatically perform 5-fold cross-validation with a variety of various regularization parameters so as to seek out the optimal value of alpha. you ought to see that the optimal value of alpha is 100, with a negative MSE of -29.90570. we will easily observe a small improvement on comparing with the essential multiple rectilinear regression .

**The code seems like this: **

Resource Article : https://www.excelr.com/blog/data-science/regression/l1_and_l2_regularization

The post Ridge and Lasso Regression appeared first on R Data science .

]]>The post Bag Of Words Using Python appeared first on R Data science .

]]>Bag of Words model is that the technique of pre-processing the text by converting it into a number/vector format, which keeps a count of the entire occurrences of most often used words within the document. This model is especially visualized employing a table, which contains the count of words like the word itself. In other words, it are often explained as a way to extract features from text documents and use these features for training machine learning algorithms. It tends to make a vocabulary of all the unique words occurring within the training set of the documents.

**Example:**

We shall be taking a well-liked example to elucidate Bag-of-Words (BoW) and make this journey of understanding a far better one. We all love doing online shopping, and yes, it’s always important to seem for reviews for a product before we plan to buying it. So, we’ll use this instance here.

Here’s a sample of reviews a few particular cosmetic product:

Review 1: This product is beneficial and fancy

Review 2: This product is beneficial but not trending

Review 3: This product is awesome and fancy

We can actually observe a 100 such contrasting reviews about the merchandise also as its features, and there’s tons of interesting insights we will draw from it, and eventually predict which product is best for us to shop for .

Now, the essential requirement is to process the text and convert it into vectorized form. And this will be easily done using Bag of Words which is that the simplest sort of text representation in numbers.

**Limitations:**

But this word embedding technique has some pitfalls thanks to which developers prefer using TF-IDF or word2vec when handling an outsized amount of knowledge .

**Let’s discuss them:**

First issue arises in cases when the new sentences contain new words. If such happens, then the vocabulary size would increase and thereby, the length of the vectors would increase too.

Additionally, the vectors would also contain many 0s, thereby leading to a sparse matrix (which is what we might wish to avoid)

Secondly, we are gaining no information about the grammatical section nor are we focussing on the order of words in text.

Resource Article : https://www.excelr.com/blog/data-science/natural-language-processing/implementation-of-bag-of-words-using-python

The post Bag Of Words Using Python appeared first on R Data science .

]]>The post Time Series Analysis appeared first on R Data science .

]]>**What Is a Time Series?**

Time series is an ordered sequence of data points spread over a period of time. Here, time is generally an independent variable while the other variable/s keep changing values. The time series data is monitored over constant temporal intervals. This data can be in any measurable and quantifiable parameter related to the field of business, science, finance, etc.

**What is Time Series Analysis?**

Time series analysis refers to identifying the common patterns displayed by the data over a period of time. For this, experts employ specific methods to study the data’s characteristics characteristics and extract meaningful statistics that eventually aid in business forecasting.

Learn Forecasting our Data science course designed for beginners for a better understanding of the concept.

**Time Series Analysis and Forecasting Tactics**

Certain features of the given time series aid in creating are used to create models that help predict assist in predicting business metrics and the future behaviour of business metrics. The better you can figure out the characteristics of the given data’s characteristics, the more accurate the forecasts will be. Below is an overview of 18 crucial concepts, methods, and things to know for efficient business forecasting:

- Time series forecasting methods are a group of statistical techniques that can be vital for estimating different variables and be used for any business for estimating different variables.
- To obtain accurate forecasts, you need to check for three essential features in a time series. These are autocorrelation, seasonality, and stationarity.

**Autocorrelation and Seasonality**

- Autocorrelation is a mathematical term that indicates the extent of similarity between the given time series and its delayed version over a particular time. This time series refers to a set of values of a variable/entity.
- Autocorrelation helps determine the relationship between current values and the past values of an entity. By using the past and current data, the professionals can identify and analyse the data patterns, establish relations, and plan for the future.
- When an entity exhibits similar values periodically, i.e. after every fixed time interval, it makes way for measuring seasonality. For example, business sales of certain products show a similar increase in every festive season.
- Seasonality lays the ground for predictability of the variable as per a particular time of the day, month, season, or occasion. With the help of seasonal variation data, the salespeople can devise their strategy ahead of that specific period.

**Stationarity and Trends**

- When the statistical properties of a time series’ statistical properties remain constant over time, it is said to be stationary. In other words, the mean and variance of the series stay the same. Entities like stock prices are usually not static.
- Stationarity of a time series is checked by conducting a KPSS test, Dickey-Fuller test, or extended versions of these tests. Methods to detect stationarity are primarily statistical in nature. These tests basically evaluate a null hypothesis in one way or the other.
- Stationarity is regarded as quite crucial in a series, else a model displaying the data shows different accuracy at different time points. So, before modelling, the professionals use some techniques to transform a given non-stationary time series into a stationary one.
- Trends are recorded over a long time. Depending upon the nature of the entity and related influencing factors, its trend may decrease, increase, or remain stable. For example, population, birth rate, death rate, etc. are some of the entities that mostly show movement and thus, cannot form a stationary time series.

Resource Article : https://www.excelr.com/blog/data-science/forecasting/18-time-series-analysis-tactics-that-will-help-you-win-in-2020

The post Time Series Analysis appeared first on R Data science .

]]>The post Correlation Vs Covariance appeared first on R Data science .

]]>Despite the similarities between these mathematical terms, they’re different from one another .

Covariance is when two variables vary with one another , whereas Correlation is when the change in one variable leads to the change in another variable.

**Covariance**

Covariance signifies the direction of the linear relationship between the 2 variables. By direction we mean if the variables are directly proportional or inversely proportional to every other. (Increasing the worth of 1 variable may need a positive or a negative impact on the worth of the opposite variable).

The values of covariance are often any number between the 2 opposite infinities. Also, it’s important to say that covariance only measures how two variables change together, not the dependency of 1 variable on another one.

**Correlation**

Correlation analysis may be a method of statistical evaluation wont to study the strength of a relationship between two, numerically measured, continuous variables.

It not only shows the type of relation (in terms of direction) but also how strong the connection is. Thus, we will say the correlation values have standardized notions, whereas the covariance values aren’t standardized and can’t be wont to compare how strong or weak the connection is because the magnitude has no direct significance. It can assume values from -1 to +1.

Covariance and correlation are associated with one another , within the sense that covariance determines the sort of interaction between two variables, while correlation determines the direction also because the strength of the connection between two variables.

Resource Article : https://www.excelr.com/blog/data-science/statistics-for-data-scientist/Correlation-vs-covariance

The post Correlation Vs Covariance appeared first on R Data science .

]]>The post Simple Linear Regression In R appeared first on R Data science .

]]>Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the idea of latest** **x values.

Consider that, we would like to guage** **the impact of advertising budgets of three medias (youtube, facebook and newspaper) on future sales. this instance** **of problem are often** **modeled with linear regression** **.

The mathematical formula of the linear regression are often written as y = b0 + b1*x + e, where:

b0 and b1 are referred to as the regression beta coefficients or parameters:

1)b0 is that the intercept of the regression line; that’s the expected value when x = 0.

2)b1 is that the slope of the regression curve.

3)e is that the error term (also referred to as the residual errors), the a part of y which will be explained by the regression model

The figure below illustrates the linear regression model, where:

*The best-fit regression curve is in blue

*The intercept (b0) and therefore the slope (b1) are shown in green

*The error terms (e) are represented by vertical red lines

From the scatter plot above, it are often seen that not all the info points fall exactly on the fitted regression curve . a number of the points are above the blue curve and a few are below it; overall, the residual errors (e) have approximately mean zero.

The sum of the squares of the residual errors are called the Residual Sum of Squares or RSS.

The average variation of points round the fitted regression curve is named the Residual Standard Error (RSE). this is often one the metrics wont to evaluate the general quality of the fitted regression model. The lower the RSE, the higher it’s .

Since the mean error term is zero, the result variable y are often approximately estimated as follow:

y ~ b0 + b1*x

Mathematically, the beta coefficients (b0 and b1) are determined in order that the RSS is as minimal as possible. This method of determining the beta coefficients is technically called method of least squares |statistical method|statistical procedure”> method of least squares regression or ordinary least squares (OLS) regression.

Once, the beta coefficients are calculated, a t-test is performed to see whether or not these coefficients are significantly different from zero. A non-zero beta coefficients means there’s a big relationship between the predictors (x) and therefore the outcome variable (y).

Resource Article : https://www.excelr.com/blog/data-science/regression/simple-linear-regression

The post Simple Linear Regression In R appeared first on R Data science .

]]>