"The Syntactic Acceptability Dataset as a resource for machine learning and linguistic analysis"
October 26, 2022, Schedule:
- Nespresso & Teatime ( 417 DSL - Commons )
- 03:00 to 03:30 PM Eastern Time (US and Canada)
- Colloquium - F2F ( 499 DSL ) / Virtual ( Zoom )
- 03:30 to 04:30 PM Eastern Time (US and Canada)
Meeting # 942 7359 5552
Abstract:
Linguistic datasets are popular in machine learning, particularly in the emerging field of few shot learning (learning from limited data), as linguistic data is often complex and difficult to generalize from, and thus a welcome challenge (Wang et al. 2020). In this talk, I will outline ongoing research on building a new dataset valuable to both the machine learning community and the linguistic community. The new dataset will be based on COLA (Corpus of Linguistic Acceptability; Warstadt et al. 2018), a popular dataset in machine learning. I will briefly introduce COLA, the challenges it poses, and relevant linguistic distinctions (acceptability vs grammaticality). Further, I will motivate the need for new data, a different kind of data, outline its structure, and its expected relevance to machine learning and linguistics.