Other Workshops and Lectures
Workshops by Prof. Thomas Lumley
Choosing Good Subsamples for Measuring New Variables
Researchers often want to add more measurements to an existing database. These might be new assays on stored specimens, or coding of free-text responses, or validation of EHR data against clinical notes. It is expensive to measure the new variables on everyone, so subsampling is attractive. It is possible to do much better than simple random sampling when measuring additional data: any information you already have can be used to identify the most informative records to measure. It's also possible to recover a lot of information from the records that are not chosen in the subsample. Software already exists in R to support most analyses you would want to do of the subsampled data.
Professor Lumley Workshop Materials →
Analyzing Larger Data in R
Even with growing computer power, researchers sometimes want to work with datasets that are much bigger than computer memory. The interfaces to allow selection, summarisation, and aggregation of very large datasets from R are increasingly transparent and easy to set up. I will demonstrate simple analysis of large datasets in R. I will also show how some more complicated analyses can be partitioned between R and a database to exploit the advantages of both systems. I will primarily use duckdb, but I will refer to other large-data interfaces.