This course will provide an overview of key concepts for creating an effective data science project and will introduce tools and techniques for data wrangling, statistical modelling, visualisation and reproducible reporting using R, a public domain language for data analysis. The R language provides a rich and flexible environment for working with data, especially data to be used for statistical modelling or graphics.
The R system has an extensive library of packages that offer state-of-the-art-abilities. Many of the analyses that they offer are not even available in any of the standard packages. R enables you to escape from the restrictive environments and sterile analyses offered by commonly used statistical software packages. It enables easy experimentation and exploration, which improves data analysis. Sharing your discovery of data analysis knowledge is necessary in making it useful. R is a tool that enables reporting modern data analyses in a reproducible manner. It makes analysis more useful to others because the data and code that actually conducted the analysis can be made available and easily shared. As such R has become the lingua franca of quantitative research. Accordingly, this course will emphasize packages that will help you do data analysis, visualisation and communication with a wider audience.
The course will start by introducing the fundamental concepts of R: basic use of R console through RStudio IDE, inputting and importing data, record keeping and general good practice of R project workflow. It will then progress to basic statistical concepts and statistical modelling techniques. Basic statistical concepts, which theoretically may be perceived as complex, can be more effectively communicated by using visualisation. Hence, the formal abstract nature of Statistics can be demystified by visualising its application context. This is why the focus is directed on building appropriate visualisation of a given data analysis problem, and the reporting of intelligent reproducible data analysis using RMarkdown. Using real data and real examples we will introduce you to fundamental statistical concepts to set the stage for key statistical modelling techniques. We will finish the course by introducing you to the key Machine Learning (ML) algorithms, providing you with insight into how ML adapts and modifies assumptions through its three-step process (data -> model -> action) and by reacting to errors.
Version control has become an essential tool for keeping track when working on DS projects, as well as collaborating. RStudio supports working with Git, an open source distributed version control system, which is easy to use when combined with GitHub, a web-based Git repository hosting service. Throughout the course you will be introduced to GitHub and you’ll become acquainted with good practice when incorporating the use of Git into your R project workflow.
The material is structured within four weekly modules. Each module is a day-long lesson split into morning (part I) and afternoon (part II) sessions.
Each module will be taught by Dr Tatjana Kecojevic and will cover various related topics through appropriate case studies, presentations, readings and discussion forums. Essential data handling and statistical modelling techniques are introduced during the teaching sessions. Students are then expected to use their own time to deepen their understanding of the data models presented in the session. The conceptual models come to life when practice becomes reality during the hands on taught sessions, through the application of R. Students are then expected to use their own time to practise and hone the data handling expertise acquired during the taught sessions. Students are given the opportunity to test their knowledge, both conceptual and practical, on a weekly basis through interactive student/teacher workshops.
Students are expected to participate fully in all of these delivery modes, but in particular are expected to have attempted any pre-set work and come fully prepared to discuss any problems encountered and debate the ideas and any issues raised.
We recommend you complete each of the following before the end of each week:
This course is for people from varying backgrounds and diverse profiles. It is designed for people who recognise the paramount importance of data and its use.
This course will benefit anyone who has the curiosity and desire to enter the realm of data science. We will make sense of the world of data and learn effective and attractive ways to visually analyse and communicate related information. With the knowledge gained on this course, you will be ready to undertake your very own data analysis for the first time.
Data Science is not simply fashionable jargon, but rather a discipline with a set of tools that empower data enriched living, so whatever industry you’re in, this is relevant to you!
Workshop delivery will be in English and Serbian!
© 2020 Tatjana Kecojevic