Data, data analysis, and preparing data for analysis

What is data?

  • Data is information
  • It may be in many forms – table (as given below), images, text, audio, video, etc.
  • Data size may vary – from a few kilobytes to petabytes or more

Following is an example of a simple dataset –

Alt Bar Fri Hun Pat Price Rain Res Type Est Target
Yes No No Yes Some $$$ No Yes French 0-10 Yes
Yes No No Yes Full $ No No Thai 30-60 No
No Yes No No Some $ No No Burger 0-10 Yes
Yes No Yes Yes Full $ No No Thai 10-30 Yes
Yes No Yes No Full $$$ No Yes French >60 No
No Yes No Yes Some $$ Yes Yes Italian 0-10 Yes
No Yes No No None $ Yes No Burger 0-10 No
No No No Yes Some $$ Yes Yes Thai 0-10 Yes
No Yes Yes No Full $ Yes No Burger >60 No
Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 No
No No No No None $ No No Thai 0-10 No
Yes Yes Yes Yes Full $ No No Burger 30-60 Yes

What is data analysis?

  1. Extracting useful information from data
  2. Learn patterns and structures in data, and answer questions given new data

Example of data-analysis

  • Predict team that would win in a given match
  • Estimate the room rent for a given house
  • Recognize objects/handwriting in images
  • And much more.

Preparing data for analysis

  1. Data should be understood – just like requirements should be understood before developing a software
    • The meaning of each column should be known for efficient analysis
  2. Permission, schema, and location of data should be known
    • For example, we should be able to join from all required tables in a relation into a single table for analysis
  3. Data should be cleaned by removing
    • duplicate rows
    • inconsistent or impossible values, like a negative tip on a restaurant bill data
    • invalid data, like data collected when transferring some sensor to the test location
    • redundant data – both hours and minutes give the same information, hence both are not desired
    • outliers
  4. Data might need to be formatted – example, to change the values from name to identifying numbers