3.1 Intro to Data Science
Describe what information can be extracted from data.
Describe what information can be extracted from metadata.
Identify the challenges associated with processing data.
Information is the collection of facts and patterns extracted from data.
Data provide opportunities for identifying trends, making connections, and addressing problems.
Digitally processed data may show correlation between variables. A correlation found in data does not necessarily indicate that a causal relationship exists. Additional research is needed to understand the exact nature of the relationship.
Often, a single source does not contain the data needed to draw a conclusion. It may be necessary to combine data from a variety of sources to formulate a conclusion.
Metadata are data about data. For example, the piece of data may be an image, while the metadata may include the date of creation or the file size of the image.
Changes and deletions made to metadata do not change the primary data.
Metadata are used for finding, organizing, and managing information.
Metadata can increase the effective use of data or data sets by providing additional information.
Metadata allow data to be structured and organized.
The ability to process data depends on the capabilities of the users and their tools.
Data sets pose challenges regardless of size, such as:
the need to clean data
the need to combine data sources
Depending on how data were collected, they may not be uniform. For example, if users enter data into an open field, the way they choose to abbreviate, spell, or capitalize something may vary from user to user.
Cleaning data is a process that makes the data uniform without changing their meaning (e.g., replacing all equivalent abbreviations, spellings, and capitalizations with the same word).
Problems of bias are often created by the type or source of data being collected. Bias is not eliminated by simply collecting more data.
The size of a data set affects the amount of information that can be extracted from it.
Large data sets are difficult to process using a single computer and may require parallel systems.
Scalability of systems is an important consideration when working with data sets, as the computational capacity of a system affects how data sets can be processed and stored.