There's a new free, self-paced online course about Data Science from Microsoft's Azure Cloud Advocates. Its 20 lesson curriculum, expected to take 10 weeks to complete, is targeted at those new to Data Science. Of course, it uses Python.
The term Data Science is used to define a vast array of topics. Its definition has become even more broad by overlapping with the sibling fields of applied mathematics, statistics, machine learning, AI etc.
If asked about the difference between ML and Data Science, I would described ML as being a technology based on data that trains models based on that data in order to tune an algorithm. Data Science is broader than that; it involves collecting, cleaning and aggregating data, visualizing it and using it statistically in order to reach data-driven decisions. However, as we'll discover in this course both fields converge at certain points like data processing or training predictive models.
The definition that this course provides for Data Science is that it encompasses all the following processes:
1. Data Acquisition The first step is to collect the data. While in many cases it can be a straightforward process, like data coming to a database from web application, sometimes we need to use special techniques.
2. Data Storage Storing the data can be challenging, especially if we are talking about big data. There are several ways data can be stored:
Relational database stores a collection of tables, and uses a special language called SQL to query them.
NoSQL database, such as CosmosDB, does not enforce schema on data, and allows storing more complex data, for example, hierarchical JSON documents or graphs.
Data Lake storage is used for large collections of data in raw form.
3. Data Processing Processing the data from its original form to the form that can be used for visualization/model training. When dealing with unstructured data such as text or images, we may need to use some AI techniques to extract features from the data, thus converting it to structured form.
4. Visualization / Human Insights Often, data scientist needs to "play with data", visualizing it many times and looking for some relationships. Also, we may use techniques from statistics to test some hypotheses or prove correlation between different pieces of data.
5. Training predictive model Because the ultimate goal of data science is to be able to take decisions based on data, we may want to use the techniques of Machine Learning to build predictive model that will be able to solve our problem.
The course therefore looks at each of those processes in detail. This includes:
Statistics and Probability Theory Mean, Variance and Standard Deviation Mode, Median and Quartiles
Working with Data Relational Databases and their properties of relationships; Retrieving data, Joining data
Working with Non-Relational Data
Spreadsheets, NoSQL, JSON, Document Data Stores with the Azure Cosmos DB
Working with Tabular Data and Dataframes Python and the Pandas Library practicing on the real world examples of Analyzing COVID Spread modelling and Analyzing COVID scientific papers
Data Preparation and Visualizing with Matplotlib Cleaning data, Visualizing Quantities, Visualizing Proportions, Visualizing Distributions
Data Science in the Cloud with Azure Training models using Low Code tools, Deploying models with Azure Machine Learning Studio
Data Science Ethics
The syllabus in detail:
Defining Data Science
Data Science Ethics
Defining Data
Introduction to Statistics & Probability
Working with Relational Data
Working with NoSQL Data
Working with Python
Data Preparation
Visualizing Quantities
Visualizing Distributions of Data
Visualizing Proportions
Visualizing Relationships
Meaningful Visualizations
Introduction to the Data Science lifecycle
Analyzing
Communication
Data Science in the Cloud
Data Science in the Wild
Resources wise, it's pretty much a complete class that it includes nice sketches, supplemental videos quizzes, step-by-step guides on how to build the projects, knowledge checks, challenges and assignments which should be enough to get your journey started.
By way of prerequisites, it's recommended to have a basic understanding of Python, Visual Studio Code and be able to run code in Jupyter Notebooks.
After going through it, you'll be looking for the next steps. A good option on a familiar path is to go with another Microsoft course, "Machine Learning for Beginners" which follows the same structure.
Apache Fury has been updated to add GraalVM native images and with optimized serializers for Scala collection. The update also reduces Scala collection serialization cost via the use of encoding [ ... ]
Google has released the enterprise version of Gemini Code Assist. This latest version adds the ability to train on internal polices and source code. The product was announced at the Google Cloud Summi [ ... ]