Merging and Joining Data
Overview
Teaching: 45 min
Exercises: 10 minQuestions
How can I combine data from different sources?
Objectives
Learn how to stack data frames (ex. different years’ data)
Learn how to join - inner, outer, etc.
Explore after a join to verify results as expected
Introduction goes here
Very often, studies record different types of data in different tables. For example, there may be one table with demographic information about participants; there might be another table with lab measurements associated with the participants.
Discussion:
When might be some scenarios where it might be preferable not to record all of the study information in a single table?
For the next several lessons, we’ll be working with a data set from the Study of Women’s Health Across the Nation (SWAN). SWAN is a multi-site longitudinal, epidemiologic study. The study, which began in 1994 and is still ongoing, examines the physical, biological, psychological, and social changes in women during their middle years.
Citation:
Sutton-Tyrell, Kim, Selzer, Faith, Sowers, MaryFran R. (Mary Francis Roy), Finkelstein, Joel, Powell, Lynda, Gold, Ellen B., … Matthews, Karen. Study of Women’s Health Across the Nation (SWAN), 2002-2004: Visit 06 Dataset. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2018-11-13. https://doi.org/10.3886/ICPSR31181.v2
The SWAN study collected data in 3-year cycles. We’ll be working with the 2002-2004 cycle data. Although the study collected data on hundreds of variables across thousands of women, we’ll focus our analysis on the following smaller set of variables:
Variable | Description |
---|---|
SWANID | Participant’s Study Identifier |
AGE6 | Age |
RACE | Race at Screener |
BMI6 | Body Mass Index (BMI) |
LDLRESU6 | Low-density lipoprotein cholesterol (estimated) mg/dl |
HDLRESU6 | High density lipoprotein cholesterol mg/dl |
GLUCRES6 | Glucose mg/dl |
CRPRESU6 | C-reactive protein (CRP) mg/l |
DIABP16 | Diastolic blood pressure mmHg |
SYSBP16 | Systolic blood pressure mmHg |
EXERCIS6 | During the past 12 months, have you exercised? |
SMOKERE6 | Since your last study visit, have you smoked cigarettes regularly (at least one cigarette a day)? |
STATUS6 | Menopausal Status |
STATUS6 | Menopausal Status : 1 = Post by BSO (Bilateral Salpingo Oophorectomy) 2 = Natural Post 3 = Late Peri 4 = Early Peri 5 = Pre 6 = Pregnant/breastfeeding 7 = Unknown due to HT use 8 = Unknown due to hysterectomy
RACE Race at Screener 1: Black/African American 2: Chinese/Chinese American 3: Japanese/Japanese American 4: Caucasian/ White Non-Hispanic 5: Hispanic
In our case, we have two tables, a demographic table and a table with lab measurements.
We’ll start by reading each of them into a separate data frame.
swan_demographic <- read.csv('data/SWAN_2002_2004/DS0001/swan_demo.csv')
swan_lab <- read.csv('data/SWAN_2002_2004/DS0001/swan_lab.csv')
Let’s take a peek at each data frame:
head(swan_demographic)
SWANID VISIT INTDAY6 AGE6 LANGINT6 RACE
1 10046 6 2310 58 1 2
2 10056 6 2199 57 1 4
3 10126 6 2213 54 1 1
4 10153 6 2201 57 1 3
5 10196 6 2213 52 1 2
6 10245 6 2296 54 1 4
swan_lab
is very wide, but feel free to peek at it using head() and/or by clicking on it in RStudio’s Environment pane. Notice that when there are many variables, RStudio provides buttons for paginating left and right through the variables.
Let’s look at the variable names (column names):
colnames(swan_demographic)
[1] "SWANID" "VISIT" "INTDAY6" "AGE6" "LANGINT6" "RACE"
colnames(swan_lab)
[1] "SWANID" "VISIT" "PREGNAN6" "PREVBLO6" "EATDRIN6" "STRTPER6"
[7] "BLEDAY6" "BLDRWAT6" "BLDDRAW6" "SPEDAY6" "ANTICO16" "ACOATW16"
[13] "ANTICO26" "ACOATW26" "HEART16" "HARTTW16" "HEART26" "HARTTW26"
[19] "CHOLST16" "CHOLTW16" "CHOLST26" "CHOLTW26" "BP16" "BPTW16"
[25] "BP26" "BPTW26" "DIURET16" "DIURTW16" "DIURET26" "DIURTW26"
[31] "THYROI16" "THYRTW16" "THYROI26" "THYRTW26" "INSULN16" "INSUTW16"
[37] "INSULN26" "INSUTW26" "NERVS16" "NERVTW16" "NERVS26" "NERVTW26"
[43] "STEROI16" "STERTW16" "STEROI26" "STERTW26" "FERTIL16" "FRTLTW16"
[49] "FERTIL26" "FRTLTW26" "BCP16" "BCPTWI16" "BCP26" "BCPTWI26"
[55] "BCREAS6" "BCRES_S6" "ESTROG16" "ESTRTW16" "ESTROG26" "ESTRTW26"
[61] "ESTRDA16" "ESTRDA26" "ESTRNJ16" "EINJTW16" "ESTRNJ26" "EINJTW26"
[67] "COMBIN16" "COMBTW16" "COMBIN26" "COMBTW26" "PROGES16" "PROGTW16"
[73] "PROGES26" "PROGTW26" "PROGDA16" "PROGDA26" "OSTEPR16" "OSTETW16"
[79] "OSTEPR26" "OSTETW26" "ARTHRT16" "ARTHTW16" "ARTHRT26" "ARTHTW26"
[85] "OTHMED16" "OTHRTW16" "OTHMED26" "OTHRTW26" "OTHMED36" "OTHRTW36"
[91] "OTHMED46" "OTHRTW46" "OTHMED56" "OTHRTW56" "OTHMED66" "OTHRTW66"
[97] "OTHMED76" "OTHRTW76" "OTHMED86" "OTHRTW86" "OTHMED96" "OTHRTW96"
[103] "OTHME106" "OTHTW106" "OTHME116" "OTHTW116" "OTHME126" "OTHTW126"
[109] "OTHME136" "OTHTW136" "OTHME146" "OTHTW146" "OTHME156" "OTHTW156"
[115] "ESTLSTV6" "REDUHAR6" "OSTEOPO6" "MENOSYM6" "YOUNGLK6" "HCPADVI6"
[121] "FRNADVI6" "IMPRMEM6" "REGPERI6" "HORMOTH6" "DONTKNO6" "HORMDAY6"
[127] "PRBBLEE6" "HAVEPER6" "LIKEFEL6" "SIDEEFF6" "CANCER6" "ADVISTO6"
[133] "EXPENSI6" "NOLIKE6" "NOREMEB6" "DNTKNOW6" "STOPOTH6" "NOREASO6"
[139] "REGVITA6" "ONCEADA6" "ANTIOXI6" "VITCOMB6" "VTMOTH16" "VTMOTH26"
[145] "VTMOTH36" "VTMOTH46" "VITAMNA6" "BETACAR6" "VITAMNC6" "VITAMND6"
[151] "VITAMNE6" "CALCTUM6" "IRON6" "ZINC6" "SELENIU6" "VTMSING6"
[157] "SVTMOT16" "SVTMOT26" "SVTMOT36" "SVTMOT46" "SVTMOT56" "SVTMOT66"
[163] "SVTMOT76" "SVTMOT86" "SVTMOT96" "SVTMO106" "PAIN16" "PAINTW16"
[169] "PAIN26" "PAINTW26" "SLEEP16" "SLEPTW16" "SLEEP26" "SLEPTW26"
[175] "OTC16" "OTCTW16" "OTC26" "OTCTW26" "OTC36" "OTCTW36"
[181] "OTC46" "OTCTW46" "ALCHL246" "SOYYSNO6" "SOYPROT6" "CEREACA6"
[187] "BREADCA6" "ORANGCA6" "ANEMIA6" "DIABETE6" "HIGHBP6" "HBCHOLE6"
[193] "MIGRAIN6" "STROKE6" "OSTEOAR6" "THYROID6" "HEARTAT6" "ANGINA6"
[199] "OSTEOPR6" "CANCERS6" "SITESPE6" "TAMOXIF6" "CHEMOTH6" "BROKEBO6"
[205] "BONES16" "HAPPEN16" "BONES26" "HAPPEN26" "BONES36" "HAPPEN36"
[211] "DANDC6" "NUMDAND6" "HYSTERE6" "HYSTDAY6" "OOPHORE6" "ONEOVAR6"
[217] "ABLATIN6" "UTERPRO6" "THYRREM6" "ENDO6" "ENDODIF6" "PELVCPN6"
[223] "DIFPELV6" "PROLAPS6" "DIFPROL6" "PELVCNC6" "DIFCANC6" "ABBLEED6"
[229] "DIFBLED6" "FIBRUTR6" "DIFFIBR6" "PRVIDER6" "PROFDEG6" "SPECIFY6"
[235] "PROVSPC6" "SPECIAL6" "PROVTIM6" "BLEEDNG6" "BLD3MON6" "LMPDAY6"
[241] "DESCPER6" "LENGCYL6" "PRGNANT6" "OUTCOME6" "BRSTFEE6" "QLTYLIF6"
[247] "LISTEN6" "TAKETOM6" "CONFIDE6" "HELPSIC6" "CONTROL6" "ABILITY6"
[253] "YOURWAY6" "PILING6" "BOTHER6" "APPETIT6" "BLUES6" "GOOD6"
[259] "KEEPMIN6" "DEPRESS6" "EFFORT6" "HOPEFUL6" "FAILURE6" "FEARFUL6"
[265] "RESTLES6" "HAPPY6" "TALKLES6" "LONELY6" "UNFRNDL6" "ENJOY6"
[271] "CRYING6" "SAD6" "DISLIKE6" "GETGOIN6" "SEPEXP6" "SEPTHR6"
[277] "SEPINJ6" "SEPFAM6" "SEPLEV6" "SEPSAF6" "SEPCON6" "SEPEAC6"
[283] "SEPFUT6" "SEPGOV6" "SEPCNF6" "SEPANX6" "CHNGJOB6" "JOB6"
[289] "CHANGHR6" "STRTIM16" "STPTIM16" "ROTAT16" "STRTIM26" "STPTIM26"
[295] "ROTAT26" "STRTIM36" "STPTIM36" "ROTAT36" "HOURSPA6" "VOLUNTE6"
[301] "VLNTHR16" "VLNTHR26" "VLNTHR36" "MARITAL6" "CHGHHLD6" "HOUSEHL6"
[307] "RELAT16" "SEX16" "AGE16" "RELAT26" "SEX26" "AGE26"
[313] "RELAT36" "SEX36" "AGE36" "RELAT46" "SEX46" "AGE46"
[319] "RELAT56" "SEX56" "AGE56" "RELAT66" "SEX66" "AGE66"
[325] "RELAT76" "SEX76" "AGE76" "RELAT86" "SEX86" "AGE86"
[331] "RELAT96" "SEX96" "AGE96" "RELAT106" "SEX106" "AGE106"
[337] "RELAT116" "SEX116" "AGE116" "RELAT126" "SEX126" "AGE126"
[343] "FIBROID6" "FORMINT6" "STATUS6" "SAADAY6" "FORMSAA6" "LANGSAA6"
[349] "OVERHLT6" "HOSPSTA6" "MDTALK6" "NERVES6" "PAPSMEA6" "BRSTEXA6"
[355] "MAMOGRA6" "MEDICDY6" "INSURDR6" "INSDRCO6" "INSURRX6" "INSRXCO6"
[361] "INSURHO6" "INSHOCO6" "HLTHSER6" "INSURAN6" "NOTAFFR6" "NOTRANS6"
[367] "NOPROVI6" "TOOBUSY6" "NOTRUST6" "BETTROF6" "FAILOTH6" "SMOKERE6"
[373] "AVCIGDA6" "DRNKBEE6" "GLASBEE6" "GLASWIN6" "GLASLIQ6" "HLTHAYR6"
[379] "V_ACTI6" "M_ACTI6" "LIFTING6" "CLIMBS6" "CLIMB1_6" "BENDING6"
[385] "WALKM6" "WALKS6" "WALK1_6" "BATHING6" "PHYCTDW6" "PHYACCO6"
[391] "PHYLIMI6" "PHYDFCL6" "EMOCTDW6" "EMOACCO6" "EMOCARE6" "INTERFR6"
[397] "BODYPAI6" "PAINTRF6" "PEP6" "NERV4WK6" "CHER4WK6" "CALM4WK6"
[403] "ENERGY6" "BLUE4WK6" "WORNOUT6" "HAPY4WK6" "TIRED6" "SOCIAL6"
[409] "HEALSIC6" "HEALTHY6" "HEALWOR6" "HEALEXC6" "CARING6" "MEALS6"
[415] "ROUTNCH6" "MODERAT6" "VIGOROU6" "PHYSACT6" "WATCHTV6" "WALKBIK6"
[421] "SWEATPA6" "SPORTS6" "SPOREX16" "RATEIN16" "MTHSAC16" "HRSACT16"
[427] "OTHSPOR6" "SPOREX26" "RATEIN26" "MTHSAC26" "HRSACT26" "PHYSWOR6"
[433] "WORKTIR6" "WRKACTA6" "WRKACTB6" "WRKACTC6" "WRKACTD6" "WRKACTE6"
[439] "WRKACTF6" "WRKACTG6" "STIFF6" "ACHES6" "COLDSWE6" "NITESWE6"
[445] "VAGINDR6" "FEELBLU6" "DIZZY6" "IRRITAB6" "NRVOUS6" "FORGET6"
[451] "MOODCHG6" "HARTRAC6" "FEARFULA6" "HDACHE6" "HOTFLAS6" "BRSTPAI6"
[457] "TRBLSLE6" "WAKEUP6" "WAKEARL6" "SLEEPQL6" "GETUPUR6" "INVOLEA6"
[463] "DAYSLEA6" "COUGHLE6" "URGEVOI6" "OTHRLEA6" "NUTRIRE6" "HERBREM6"
[469] "PSYCMET6" "PHYSMET6" "FOLKMED6" "OTHRTHE6" "SPECOTH6" "STARTNE6"
[475] "WORKTRB6" "QUITJOB6" "WORKLOA6" "PRTUNEM6" "MONEYPR6" "WORSREL6"
[481] "RELATEN6" "SERIPRO6" "CHILDMO6" "RESPCAR6" "LEGALPR6" "CRELDIE6"
[487] "CLOSDIE6" "SELFVIO6" "FAMLVIO6" "PHYSILL6" "MAJEVEN6" "EMPLYPA6"
[493] "REWRDJO6" "STRSSJO6" "CRNTCAR6" "RWRDCAR6" "STRSCAR6" "CRNTMAR6"
[499] "RWRDREL6" "STRSREL6" "CHILDRE6" "REWRDMO6" "STRSSMO6" "INCOME6"
[505] "HOW_HAR6" "LOSSINC6" "FINAN1Y6" "FINAN2Y6" "FINAN3Y6" "HOMEOWN6"
[511] "HOMESPE6" "LADERCO6" "LADERUS6" "INTRPAN6" "DISIPAN6" "EXCIPAN6"
[517] "UPSEPAN6" "STROPAN6" "GUILPAN6" "SCARPAN6" "HOSTPAN6" "ENTHPAN6"
[523] "PROUPAN6" "IRRIPAN6" "ALERPAN6" "ASHAPAN6" "INSPPAN6" "NERVPAN6"
[529] "DETEPAN6" "ATTEPAN6" "JITTPAN6" "ACTIPAN6" "AFRAPAN6" "STRENEX6"
[535] "MODEREX6" "ACUPUNC6" "ACUPHAR6" "ACUPOST6" "ACUPMEN6" "ACUPLOO6"
[541] "ACUPMEM6" "ACUPPER6" "ACUPGEN6" "ACUPWGH6" "ACUPADV6" "ACUPOTH6"
[547] "ACUPSPE6" "BCOHOSH6" "BCOHHAR6" "BCOHOST6" "BCOHMEN6" "BCOHLOO6"
[553] "BCOHMEM6" "BCOHPER6" "BCOHGEN6" "BCOHWGH6" "BCOHADV6" "BCOHOTH6"
[559] "BCOHSPE6" "DHEA6" "DHEAHAR6" "DHEAOST6" "DHEAMEN6" "DHEALOO6"
[565] "DHEAMEM6" "DHEAPER6" "DHEAGEN6" "DHEAWGH6" "DHEAADV6" "DHEAOTH6"
[571] "DHEASPE6" "DQUAI6" "DQUAHAR6" "DQUAOST6" "DQUAMEN6" "DQUALOO6"
[577] "DQUAMEM6" "DQUAPER6" "DQUAGEN6" "DQUAWGH6" "DQUAADV6" "DQUAOTH6"
[583] "DQUASPE6" "DIETNUT6" "DIETHAR6" "DIETOST6" "DIETMEN6" "DIETLOO6"
[589] "DIETMEM6" "DIETPER6" "DIETGEN6" "DIETWGH6" "DIETADV6" "DIETOTH6"
[595] "DIETSPE6" "HEALENR6" "HEALHAR6" "HEALOST6" "HEALMEN6" "HEALLOO6"
[601] "HEALMEM6" "HEALPER6" "HEALGEN6" "HEALWGH6" "HEALADV6" "HEALOTH6"
[607] "HEALSPE6" "EXERCIS6" "EXERHAR6" "EXEROST6" "EXERMEN6" "EXERLOO6"
[613] "EXERMEM6" "EXERPER6" "EXERGEN6" "EXERWGH6" "EXERADV6" "EXEROTH6"
[619] "EXERSPE6" "FLAXSEE6" "FLAXHAR6" "FLAXOST6" "FLAXMEN6" "FLAXLOO6"
[625] "FLAXMEM6" "FLAXPER6" "FLAXGEN6" "FLAXWGH6" "FLAXADV6" "FLAXOTH6"
[631] "FLAXSPE6" "GINKGO6" "GINKHAR6" "GINKOST6" "GINKMEN6" "GINKLOO6"
[637] "GINKMEM6" "GINKPER6" "GINKGEN6" "GINKWGH6" "GINKADV6" "GINKOTH6"
[643] "GINKSPE6" "GINSENG6" "GINSHAR6" "GINSOST6" "GINSMEN6" "GINSLOO6"
[649] "GINSMEM6" "GINSPER6" "GINSGEN6" "GINSWGH6" "GINSADV6" "GINSOTH6"
[655] "GINSSPE6" "GLUSAMI6" "GLUSHAR6" "GLUSOST6" "GLUSMEN6" "GLUSLOO6"
[661] "GLUSMEM6" "GLUSPER6" "GLUSGEN6" "GLUSWGH6" "GLUSADV6" "GLUSOTH6"
[667] "GLUSSPE6" "MYAMPRO6" "MYAMHAR6" "MYAMOST6" "MYAMMEN6" "MYAMLOO6"
[673] "MYAMMEM6" "MYAMPER6" "MYAMGEN6" "MYAMWGH6" "MYAMADV6" "MYAMOTH6"
[679] "MYAMSPE6" "MSM6" "MSMHAR6" "MSMOST6" "MSMMEN6" "MSMLOO6"
[685] "MSMMEM6" "MSMPER6" "MSMGEN6" "MSMWGH6" "MSMADV6" "MSMOTH6"
[691] "MSMSPE6" "PRAYER6" "PRAYHAR6" "PRAYOST6" "PRAYMEN6" "PRAYLOO6"
[697] "PRAYMEM6" "PRAYPER6" "PRAYGEN6" "PRAYWGH6" "PRAYADV6" "PRAYOTH6"
[703] "PRAYSPE6" "SAME6" "SAMEHAR6" "SAMEOST6" "SAMEMEN6" "SAMELOO6"
[709] "SAMEMEM6" "SAMEPER6" "SAMEGEN6" "SAMEWGH6" "SAMEADV6" "SAMEOTH6"
[715] "SAMESPE6" "SELFHEL6" "SELFHAR6" "SELFOST6" "SELFMEN6" "SELFLOO6"
[721] "SELFMEM6" "SELFPER6" "SELFGEN6" "SELFWGH6" "SELFADV6" "SELFOTH6"
[727] "SELFSPE6" "SOYSUPP6" "SOYHAR6" "SOYOST6" "SOYMEN6" "SOYLOO6"
[733] "SOYMEM6" "SOYPER6" "SOYGEN6" "SOYWGH6" "SOYADV6" "SOYOTH6"
[739] "SOYSPE6" "WORTSTJ6" "WORTHAR6" "WORTOST6" "WORTMEN6" "WORTLOO6"
[745] "WORTMEM6" "WORTPER6" "WORTGEN6" "WORTWGH6" "WORTADV6" "WORTOTH6"
[751] "WORTSPE6" "TAICHI6" "TAIHAR6" "TAIOST6" "TAIMEN6" "TAILOO6"
[757] "TAIMEM6" "TAIPER6" "TAIGEN6" "TAIWGH6" "TAIADV6" "TAIOTH6"
[763] "TAISPE6" "WVITAMI6" "WVITHAR6" "WVITOST6" "WVITMEN6" "WVITLOO6"
[769] "WVITMEM6" "WVITPER6" "WVITGEN6" "WVITWGH6" "WVITADV6" "WVITOTH6"
[775] "WVITSPE6" "YOGA6" "YOGAHAR6" "YOGAOST6" "YOGAMEN6" "YOGALOO6"
[781] "YOGAMEM6" "YOGAPER6" "YOGAGEN6" "YOGAWGH6" "YOGAADV6" "YOGAOTH6"
[787] "YOGASPE6" "OTHALT6" "OTHALTS6" "OTHHAR6" "OTHOST6" "OTHMEN6"
[793] "OTHLOO6" "OTHMEM6" "OTHPER6" "OTHGEN6" "OTHWGH6" "OTHADV6"
[799] "OTHALTR6" "WHYOTHA6" "FLGSAAV6" "SABDAY6" "LANGSAB6" "IMPORSE6"
[805] "DESIRSE6" "ENGAGSE6" "NOPARTN6" "PARTPRO6" "PHYSPRO6" "NOSEXOT6"
[811] "SATISFY6" "KISSING6" "TOUCHIN6" "ORALSEX6" "INTCOUR6" "AROUSED6"
[817] "PELVIC6" "LUBRICN6" "ABLECLM6" "IMPCLMX6" "SATISFD6" "FREQUEN6"
[823] "MEN6MOS6" "MASTURB6" "FLGSABV6" "PHYDAY6" "PULSE6" "SYSBP16"
[829] "DIABP16" "SYSBP26" "DIABP26" "HEIGHT6" "HTMETHO6" "WEIGHT6"
[835] "SCALE6" "WAIST6" "WASTMEA6" "HIP6" "HIPMEAS6" "FLGPHYV6"
[841] "BMI6" "COGDAY6" "LANGCOG6" "IMEDTHR6" "IMEDCH16" "IMEDHOU6"
[847] "IMEDFIR6" "IMEDFMN6" "IMEDCLM6" "IMEDCH26" "IMEDRES6" "IMEDMIN6"
[853] "IMEDINJ6" "IMEDEVR6" "IMEDWEL6" "TOTIDE16" "SDMTSTA6" "SDMTADM6"
[859] "SDMTPRA6" "SDMTATM6" "SDMTCOR6" "DIGIT1A6" "DIGIT1B6" "DIGIT2A6"
[865] "DIGIT2B6" "DIGIT3A6" "DIGIT3B6" "DIGIT4A6" "DIGIT4B6" "DIGIT5A6"
[871] "DIGIT5B6" "DIGIT6A6" "DIGIT6B6" "DIGTOT6" "DLAYTHR6" "DLAYCH16"
[877] "DLAYHOU6" "DLAYFIR6" "DLAYFMN6" "DLAYCLM6" "DLAYCH26" "DLAYRES6"
[883] "DLAYMIN6" "DLAYINJ6" "DLAYEVR6" "DLAYWEL6" "TOTIDE26" "FLGCOGV6"
[889] "HRMDAY6" "CYCDAY6" "DHAS6" "FSH6" "SHBG6" "T6"
[895] "E2AVE6" "FLGCV6" "FLGDIF6" "CVRDAY6" "FLAGSER6" "CHOLRES6"
[901] "TRIGRES6" "LDLRESU6" "HDLRESU6" "GLUCRES6" "INSURES6" "PAIRESU6"
[907] "TPARESU6" "LPARESU6" "APOARES6" "APOBRES6" "CRPRESU6" "SPSCDAY6"
[913] "SPSCTIM6" "SPSCMOD6" "HPSCDAY6" "HPSCTIM6" "HPSCMOD6" "SPBMDT6"
[919] "HPBMDT6" "BMDFLG6" "BIODAY6" "LANGBIO6" "AICDPUM6" "EXER12H6"
[925] "EAT5HR6" "ALCO24H6" "METJEWL6" "COMPBIA6" "SIDE6" "CONDRAW6"
[931] "CONDFRZ6" "IMPERAW6" "IMPEFRZ6" "PBFBIA6" "SKELMM6" "TBWNHAN6"
[937] "FFMNHAN6" "TBFNHAN6" "PBFNHAN6" "MISSCON6" "MISSPHY6" "FLAGSRP6"
[943] "FLGBIOV6" "CAMDAY6"
What we’d ultimately like is to match up the observations in the lab table with the demographic data, only keeping columns of potential interest.
- From swan_demographic we would like to use STUDYID, AGE6, RACE.
- From swan_lab we would like to use LDLRESU6, BMI6, GLUCRES6, CRPRESU6, DIABP16, SYSBP16, EXERCIS6, SMOKERE6
We’ll use the merge()
function to join the data frames together. The function’s parameters refer to the “left” data frame as “x”, and the “right” data frame as “y”.
The merge()
function allows us to make a number of choices:
- We can specify which variable is the one to match by. In many cases, the variable may have one name in one table, and a different name in the other table, so
by.x
andby.y
could be different. You can even specify multiple columns to match on (you would setby.x
andby.y
to vectors of variable names). - We can specify whether to keep rows from the “x” table which do not match up with any rows in the “y” table, and whether to keep rows from the “y” table which do not match up with any rows in the “x” table, using the
all.x
andall.y
parameters.
swan_df <- merge(x = swan_demographic, y = swan_lab, by.x="SWANID", by.y="SWANID",
all.x=FALSE, all.y=TRUE)
Let’s take a peek at a few of the columns in swan_df
, the data frame resulting from the merge.
head(colnames(swan_df), n=10)
[1] "SWANID" "VISIT.x" "INTDAY6" "AGE6" "LANGINT6" "RACE"
[7] "VISIT.y" "PREGNAN6" "PREVBLO6" "EATDRIN6"
We notice something unusual here. swan_df
has columns named VISIT.x
and VISIT.y
. Sometimes both of the data frames in a merge have some columns with the same names. merge()
keeps both columns, but to keep them distinct, it appends .x
to the variable from the “x” data frame, and .y
from the “y” data frame.
HERE WE NEED TO CONSIDER DPLYR FOR JOINING. #TODO
It might be a good idea to write out an intermediate data file with what we have so far, because this is the data we’ll be working with from this point forward. In a common format like CSV, you might wish to share the file with others at this point.
#TODO: Say why we have to use row.names=FALSE
write.csv(swan_df, file = 'data_out/swan_merge.csv', row.names = FALSE)
outcome variable. –>
Key Points
Use
merge()
to stack dataUse
merge()
to join dataUnderstand joins