Basic data visualization guide for data scientists
This article will give an intro to basic data visualization and its importance in the field of data science. This article is part of our students’ knowledge share program and is written by Chaitra. Candidates looking for data science courses in Bangalore can connect with us for detailed information on classes and content. A massive amount of data is generated day to day. Data Visualization is very important for better understanding, analyzing, interpreting, and presenting the data. They help to assign meaning to the data and arrive at a relevant conclusion. Several packages can be used for this purpose. Popular packages used for this purpose are plotly and seaborn/matplotlib. Plot types for Visualization: The important types of plots used for Data visualization are as follows:
- Bar Graph
- Line Graph
- Pie Chart
- Area Chart
- Dot Graph
- Scatter plot
- Box Plot
- Geographic plot
A bar graph is used to represent categorical data with rectangle-shaped bars. Bar graphs are used for comparison. If the x-axis represents categorical data, then the y-axis will be numerical. The length of these bars is proportional to the values they represent. The bars can be vertical or horizontal. There are different kinds of bar graphs like Grouped bar graphs/ stacked bar graphs which are used when the datasets have subgroups. The subgroups are differentiated by distinct colors.
A line plot is used to correlate two variables. The points are marked in 2D for corresponding x and y coordinates. These points are then joined with straight lines to produce a line chart. A line graph is good for visualizing trends, and progress or time-series information. The different kinds of line plots are simple line graphs and multiple line graphs. A simple line graph plots only one line on the figure object to map the independent variable with the dependent variable. Multiple line graphs contain more than one line. They represent multiple variables in a dataset wrt to the same independent variables. Multiple line charts are a great way to see data correlation and can be used for comparison.
A pie chart is a circular chart used for representing composition. Each slice represents the percentage of composition in univariate data. A slice is a part of the complete circle where 360 degrees is equivalent to 100%. Pie charts are best kept to a limit of 5 or 6 divisions for visual aesthetics, more than which will make the slices thin and difficult to read. Different types of pie charts include – Exploded charts, donut charts, 3D, etc. Pie charts are also a great way to show the average rating of data science training institutes in Bangalore!!
A Histogram is used to visualize the distribution of data over a continuous interval. The data is divided into non-overlapping intervals called bins. Each bar in a histogram represents the frequency or sometimes probability at each interval/bin. Histograms helps to estimate density and concentration. Histograms are one of the most used plots in data science. They are extremely useful for representing the distribution of both univariate and bivariate data. Gaussian distribution analysis is a good use case of histograms.
Area charts are similar to line charts, except that the area under the line is colored for comparison. The area is proportional to the percentage compostion of categorical data. The area chart is a variation of the line chart specifically designed for comparing continuous data over discrete categories. Area plots are limited to four or five categories for better visual effects. Similar to Line Graphs, Area Graphs are also used to display the development of quantitative values over a time period. Two popular Area Graphs are: Grouped and Stacked Area Graphs.
Scatter plots are commonly used for correlation between two continuous variables. Marking for the x and y coordinate of points determines the relation between independent and dependent variables. Scatter plots display how one variable impacts the other. A positive correlation in the scatter plot signifies an increase in the y variable with an increase in the x variable. It is set to be negatively correlated when one variable decreases with an increase in another variable. It is also great to represent anomalies such as outliers and the density of data. Points that end up far outside the general cluster are known as outliers. Given huge data points, it is recommended to use Hexbin to avoid overlapping.
A Box plot is a good way to find data distribution. Box plots are named so due to the box in the graph that represents the variance between the 25th percentile(Q1) to the median(Q2) and between the median to 75th percentile(Q3). The median of the data is marked by a line. There are two additional lines that sit on top and below the box, which are called whiskers. The difference between Q3 and Q1 is IQR (Inter Quartile Range). Whiskers are extended till Q1 – 1.5 * IQR and Q3 + 1.5 * IQR on both sides. The data points outside these whiskers are called ‘outliers’ as they deviate significantly from the rest of the data points. A Box and Whisker Plot (or Box Plot) are used for outliers and variance analysis between the quartiles.
Geo Plot or Choropleth Plot
Choropleth visualization is used to plot GeoJson data or mark location based on latitude and longitude. A choropleth map is a geographical representation of statistical values according to region. For example, the population density of a disease spread in a country colored by its state can give a lot of information on which states are at higher risks!
There is more to plotting that we will cover in the next article. Our data science and deep learning course in rajajinagar, Bangalore covers a lot of ground on plotting techniques and data wrangling. See you all in the next article. Happy learning!!